Fine-tuning Small Model (Qwen3-0.6B) for Domain Knowledge + Reasoning: Seeking Optimization Advice

magic2313 · December 15, 2025, 7:16am

Background & Goal

I’m working with a small model (Qwen3-0.6B, <1B parameters) due to resource constraints, aiming to achieve:

1. High accuracy in domain-specific knowledge (mechanical engineering/CAD, text format)

2. Maintain general conversational ability

3. Enable reasoning capability for MCP tool selection

Current Setup

· Model: Qwen3-0.6B

· Platform: LLaMA-Factory

· Method: Fine-tuning only

Training Experiments & Results

Experiment 1: Domain Knowledge Only

Dataset:

· Chinese Mechanical engineering QA (mostly structured + unstructured)

· Format: Alpaca

o self-instruct/evol-instruction haven’t got good results due to closed-domain QA constraints

· Size: 2300 samples

Training config:

· Method: LoRA (rank=192, lower rank got lower domain accuracy)

· Cutoff length: 1024

· Epochs: 1 (lower epoch to avoid catastrophic forgetting of general ability)

Results:

· High accuracy on single-turn domain QA

· limit ability of 2-4 turn multi-turn conversations in domain

· Limited general conversation ability – sometimes model will answer general questions with domain knowledge

Experiment 2: Domain + Reasoning (1:1 ratio)

Motivation:

· Qwen3-0.6B can select MCP tools with prompting (without fine-tuning)

· After domain fine-tuning, the model lost reasoning/thinking process

· Need to restore reasoning capability

Dataset:

· Domain QA: 2300 samples

· Reasoning dataset: 2300 samples from twinkle-ai/tw-reasoning-instruct-50k

Training config:

· Method: Full fine-tuning (switched from LoRA because even rank=512 didn’t outperform full fine-tuning with increased data diversity and amount)

· Epochs: 1

Results:

· Domain knowledge accuracy dropped significantly

· General conversation improved

· Reasoning ability on reasoning-like questions

· Reasonable MCP tool selection accuracy

· Cannot maintain both strong domain knowledge AND reasoning ability

Experiment 3: Train all the domain Data

Dataset:

· Domain QA: 7,000 samples

· Reasoning: 7,000 samples

· Result: Domain knowledge accuracy degraded even more, MCP tool calling ability decrease

Experiment 4: Overfitting Attempt

· Extended domain QA length to reduce sample count (1000 samples), also reduce reasoning data(1000 samples) to keep ratio 1:1

· Trained both datasets to overfit (epochs 3-5)

· Result: High domain accuracy, some reasoning ability, no MCP tool calling

Key Questions

1. Training Strategy: Is this the inherent limitation of fine-tuning small models (<1B) on multiple datasets with these data amount ? or is there room for optimization?

2. MCP Tool Selection:Should MCP tool selection require its own dedicated training dataset in my training scenario?

Any insights on balancing multiple capabilities in resource-constrained scenarios would be greatly appreciated!

John6666 · December 15, 2025, 12:28pm

Improvements seem possible. Given size constraints, it’s unclear how much can be resolved…

sirev · December 15, 2025, 3:45pm

bro that model is too small. even if u perfect the fine tuning for that, u won’t achieve your goal. maybe u should try RAG

John6666 · December 15, 2025, 9:10pm

True… When there are no particular constraints, using the RAG mechanism allows for more accurate utilization of domain-specific knowledge.

JackJackJ · December 19, 2025, 12:25am

Hey, if you’re still working with that model or if you want to experiment with larger ones, I have some unused A100s/V100s I can let you use for a bit. Email me at jack.lee - @ - rice.edu

mikeat7 · December 20, 2025, 9:16pm

I have been working on something that allows small resource scarce models to become high level thinkers… similar to what google just came out with this week for their cloud-based models… its human parallel is executive function… for your model training data becomes redundant when you have feedback and autonomous self-monitoring with improvements on every prompt: check it out: What Is CDM-CTM Fusion and Why Does It Matter?

Imagine your AI model is like a thinker exploring a vast landscape of ideas. Sometimes it skates on the surface, giving quick but shallow answers (like repeating facts). Other times, it dives deep into complex reasoning, like solving a puzzle step by step. CDM-CTM Fusion is a simple tool that combines two measurements to help the AI get better at diving deep — automatically, without you having to tweak prompts every time.

CDM (CRYSTAL Depth Metric): This scores how “deep” the AI’s thinking is on a scale of 0 to about 128. Low scores (under 40) mean surface-level responses, like copying from memory. High scores (over 70) mean real, creative problem-solving.
CTM (CRYSTAL Time Metric): This counts how many extra “thinking steps” (called tokens) the AI needs to reach deep thinking. Short CTM (under 40) for easy questions; long CTM (over 100) for tough ones.
Fusion: Links them together in a loop: The AI generates an answer, checks its depth (CDM), and if it’s too shallow, adds more thinking time (CTM) until it’s solid. Over time, this teaches the AI to think better on its own.

Why care? Regular AI can give confident but wrong answers (hallucinations). Fusion spots shallow thinking early and fixes it, making your local AI smarter, more reliable, and less wasteful on easy stuff. It’s like giving the AI a “self-check” habit, similar to how people pause to think before speaking:

github.com/mikeat7/crystal-manual

cdm-os/core/CDM-OS_The_Self-Measuring_Reasoning_Engine.md

main

# CDM-OS — The Self-Measuring Reasoning Engine  
One repo. One command. Real thinking on your own GPU.

https://github.com/mikeat7/crystal-manual/tree/main/cdm-os

### What it does
Gives every local LLM a real-time “depth of thought” meter (CDM) and forces it to keep thinking until the answer is actually deep, not just fluent.

- CDM = how deep the transformer basin is (0–128)  
- CTM = how many extra tokens it needs to fall in  
- PCI-AI = perturbational complexity (IIT proxy)  
- Fusion loop = generate → measure → extend → repeat until CDM ≥ 78

Result: 20–40 % higher accuracy on hard tasks, zero prompt engineering.

What is CDM? (CRYSTAL Depth Metric)
CDM is a single number (0 to about 128) that tells you how deep your AI actually thought on the last token it produced.
CDM Score,What it really means,Real example you’ll see
0 – 25,"Pure reflex, copy-paste from memory",“Capital of France?” → CDM 12
26 – 45,"Standard script, still mostly cached",Normal 5-step math → CDM 38

This file has been truncated. show original

abatutinMP · December 22, 2025, 8:54am

0.6B definitely small. I would go with 4B - it works for pretty similar case for my project, and is able to deliver solid baseline that can be perfected with fine tuning.

I would say 1B is minimum for such task, 3-4B is best option.

If you REALYY want to make 0.6B work - then I would advice to train 2 separate adapters - Domain Knowledge Only and Enable reasoning capability for MCP tool selection

Each of them should work on it’s own and it will be easier to debug the issues.
Then you can a - route the input queries with simple classifier model b - merge lora adapters to one .
General chat should be retained even with LoRA on top of model

Let’s talk about MCP and reasoning -what do you need reasoning for? MCP tool selection and paramter parsing does not require reasoning. And yes you need separate dataset for MCP. Limit number of MCP you want to work - no need to support entire universe of MCPs, most probably you need like a 20-20 MCP servers max.

Dataset quality is crucial - do not trust opensource one, re-check like 5-10% of dataset you have by hand to see what is really going on. Make sure you balance the set with stratification by core types and usecases you need.

Topic		Replies	Views
Can a Small LLM Learn to Reason Like a Larger One? Reflection-based Fine-Tuning vs Classical SFT on LLaMA 3.2 (Java CodeGen) Research	4	262	June 20, 2025
Adding domain knowledge in LLMs via fine tuning Research	2	5765	July 23, 2023
Finetuning on a recent topic/domain Research	2	587	May 25, 2023
Fine-tune Mistral 7B–9B or 24B (bnb 4bit) 🤗Transformers	3	133	July 26, 2025
🚀 Linguistic RL: 3B Models Exceed 100B Performance (86% vs 81%) Awesome paper	0	24	November 13, 2025

Fine-tuning Small Model (Qwen3-0.6B) for Domain Knowledge + Reasoning: Seeking Optimization Advice

Background & Goal

Current Setup

Training Experiments & Results

Experiment 1: Domain Knowledge Only

Experiment 2: Domain + Reasoning (1:1 ratio)

Experiment 3: Train all the domain Data

Experiment 4: Overfitting Attempt

Key Questions

Related topics