Fine-tuning Small Model (Qwen3-0.6B) for Domain Knowledge + Reasoning: Seeking Optimization Advice

Background & Goal

I’m working with a small model (Qwen3-0.6B, <1B parameters) due to resource constraints, aiming to achieve:

1. High accuracy in domain-specific knowledge (mechanical engineering/CAD, text format)

2. Maintain general conversational ability

3. Enable reasoning capability for MCP tool selection

Current Setup

· Model: Qwen3-0.6B

· Platform: LLaMA-Factory

· Method: Fine-tuning only

Training Experiments & Results

Experiment 1: Domain Knowledge Only

Dataset:

· Chinese Mechanical engineering QA (mostly structured + unstructured)

· Format: Alpaca

o self-instruct/evol-instruction haven’t got good results due to closed-domain QA constraints

· Size: 2300 samples

Training config:

· Method: LoRA (rank=192, lower rank got lower domain accuracy)

· Cutoff length: 1024

· Epochs: 1 (lower epoch to avoid catastrophic forgetting of general ability)

Results:

· High accuracy on single-turn domain QA

· limit ability of 2-4 turn multi-turn conversations in domain

· Limited general conversation ability – sometimes model will answer general questions with domain knowledge

Experiment 2: Domain + Reasoning (1:1 ratio)

Motivation:

· Qwen3-0.6B can select MCP tools with prompting (without fine-tuning)

· After domain fine-tuning, the model lost reasoning/thinking process

· Need to restore reasoning capability

Dataset:

· Domain QA: 2300 samples

· Reasoning dataset: 2300 samples from twinkle-ai/tw-reasoning-instruct-50k

Training config:

· Method: Full fine-tuning (switched from LoRA because even rank=512 didn’t outperform full fine-tuning with increased data diversity and amount)

· Epochs: 1

Results:

· Domain knowledge accuracy dropped significantly

· General conversation improved

· Reasoning ability on reasoning-like questions

· Reasonable MCP tool selection accuracy

· Cannot maintain both strong domain knowledge AND reasoning ability

Experiment 3: Train all the domain Data

Dataset:

· Domain QA: 7,000 samples

· Reasoning: 7,000 samples

· Result: Domain knowledge accuracy degraded even more, MCP tool calling ability decrease

Experiment 4: Overfitting Attempt

· Extended domain QA length to reduce sample count (1000 samples), also reduce reasoning data(1000 samples) to keep ratio 1:1

· Trained both datasets to overfit (epochs 3-5)

· Result: High domain accuracy, some reasoning ability, no MCP tool calling

Key Questions

1. Training Strategy: Is this the inherent limitation of fine-tuning small models (<1B) on multiple datasets with these data amount ? or is there room for optimization?

2. MCP Tool Selection:Should MCP tool selection require its own dedicated training dataset in my training scenario?

Any insights on balancing multiple capabilities in resource-constrained scenarios would be greatly appreciated!

3 Likes

Improvements seem possible. Given size constraints, it’s unclear how much can be resolved…

1 Like

bro that model is too small. even if u perfect the fine tuning for that, u won’t achieve your goal. maybe u should try RAG

2 Likes

True… When there are no particular constraints, using the RAG mechanism allows for more accurate utilization of domain-specific knowledge.

Hey, if you’re still working with that model or if you want to experiment with larger ones, I have some unused A100s/V100s I can let you use for a bit. Email me at jack.lee - @ - rice.edu

1 Like

I have been working on something that allows small resource scarce models to become high level thinkers… similar to what google just came out with this week for their cloud-based models… its human parallel is executive function… for your model training data becomes redundant when you have feedback and autonomous self-monitoring with improvements on every prompt: check it out: What Is CDM-CTM Fusion and Why Does It Matter?

Imagine your AI model is like a thinker exploring a vast landscape of ideas. Sometimes it skates on the surface, giving quick but shallow answers (like repeating facts). Other times, it dives deep into complex reasoning, like solving a puzzle step by step. CDM-CTM Fusion is a simple tool that combines two measurements to help the AI get better at diving deep — automatically, without you having to tweak prompts every time.

  • CDM (CRYSTAL Depth Metric): This scores how “deep” the AI’s thinking is on a scale of 0 to about 128. Low scores (under 40) mean surface-level responses, like copying from memory. High scores (over 70) mean real, creative problem-solving.

  • CTM (CRYSTAL Time Metric): This counts how many extra “thinking steps” (called tokens) the AI needs to reach deep thinking. Short CTM (under 40) for easy questions; long CTM (over 100) for tough ones.

  • Fusion: Links them together in a loop: The AI generates an answer, checks its depth (CDM), and if it’s too shallow, adds more thinking time (CTM) until it’s solid. Over time, this teaches the AI to think better on its own.

Why care? Regular AI can give confident but wrong answers (hallucinations). Fusion spots shallow thinking early and fixes it, making your local AI smarter, more reliable, and less wasteful on easy stuff. It’s like giving the AI a “self-check” habit, similar to how people pause to think before speaking:

1 Like

0.6B definitely small. I would go with 4B - it works for pretty similar case for my project, and is able to deliver solid baseline that can be perfected with fine tuning.

I would say 1B is minimum for such task, 3-4B is best option.

If you REALYY want to make 0.6B work - then I would advice to train 2 separate adapters - Domain Knowledge Only and Enable reasoning capability for MCP tool selection

Each of them should work on it’s own and it will be easier to debug the issues.
Then you can a - route the input queries with simple classifier model b - merge lora adapters to one .
General chat should be retained even with LoRA on top of model

Let’s talk about MCP and reasoning -what do you need reasoning for? MCP tool selection and paramter parsing does not require reasoning. And yes you need separate dataset for MCP. Limit number of MCP you want to work - no need to support entire universe of MCPs, most probably you need like a 20-20 MCP servers max.

Dataset quality is crucial - do not trust opensource one, re-check like 5-10% of dataset you have by hand to see what is really going on. Make sure you balance the set with stratification by core types and usecases you need.

1 Like