SFT of Instruct Model Collapse

entfane · November 16, 2025, 7:58pm

Has anyone also observed models collapsing if you try to SFT an Instruct model?
This is not the first time I’m seeing this - was trying to train a model (which I thought was foundation model) and at the end the model was generating absolute gibberish and not even getting to eos token. Later found out that the model, even though was said to be ONLY pretrained, turned out to already be familiar w chat template and instruction following.
My question - Is it a rule that if we try to sft a model that is already Instruct model that it will collapse at the end?

John6666 · November 16, 2025, 11:19pm

Even when using SFT based on the Instruct model, normal model weights are often generated.

However, since the Instruct model often already has its own quirks regarding special tokens and chat templates compared to the Base model, if it cannot mimic these, the result may appear broken. Also, note that older models may lack chat templates altogether or have inappropriate ones.

Short, direct answer:
No, it is not a rule that SFT on an already-Instruct model will “collapse” into gibberish. Lots of people successfully SFT Instruct/chat models every day. What does happen quite often is that SFT pipelines that are slightly wrong break Instruct models much more visibly than base models.

Your symptom:

“absolute gibberish”
“not even getting to eos token”

matches technical / configuration problems, not a fundamental “you SFTed an Instruct model, therefore it died” law.

Below I’ll put your experience in context, then connect it to concrete online reports and the real failure modes.

1. Background: what you did and why the model reacted badly

You thought you were fine-tuning a pretrained/base model.
In reality, the checkpoint was already instruction-tuned and chat-templated.

That matters because:

A base model has:
- no special chat tokens,
- no strong expectations on formatting,
- is relatively tolerant to you just feeding “text in, text out.”
An Instruct/chat model usually:
- expects a very specific chat template (<|user|> ... <|assistant|> ... <eos> or similar),
- has special tokens baked into its vocabulary and training,
- may have been trained with assistant-only loss on top of that format.

If you fine-tune it as if it were a plain LM (no template, wrong special tokens, wrong EOS handling, wrong masking, wrong tokenizer), you can:

distort the probability distribution over tokens so that:
- language degenerates into nonsense,
- EOS is never produced,
and/or break things at save/load or deployment time so the weights no longer match the tokenizer.

That looks like “the model collapsed because it was Instruct,” but the underlying cause is: you violated the assumptions that Instruct model was trained under.

2. Has anyone else seen this? Yes – a lot. But the causes are concrete.

Here are closely matching real reports.

2.1 Gemma-3-1B-IT: Instruct model → gibberish after instruction SFT

On the official Gemma 3 Instruct (google/gemma-3-1b-it), a user fine-tunes for instruction tasks and gets gibberish outputs. Google’s response:

likely mismatch between tokenizer and chat template used in fine-tuning vs what the Instruct model expects,
plus possible problems with fp16 numerics or data formatting. (Hugging Face)

They explicitly note that the base Gemma model (no chat special tokens) didn’t break under the same pipeline, exactly like your experience.

So: same pattern you saw, and the root cause is format/tokenizer mismatch, not a general rule that “instruct models collapse.”

2.2 Llama-2-7B-chat + softprompt: base OK, Instruct+prompt = pure gibberish

In a Hugging Face forum thread “Softprompt for Llama generating gibberish output”, a user:

trains a soft prompt on top of meta-llama/Llama-2-7b-chat-hf,
base chat model alone is fine,
but with the softprompt attached, generations are nonsensical character salad. (Hugging Face Forums)

The diagnosis revolves around:

mis-handling of input formatting with the softprompt,
incorrect placement of the prompt embeddings,
and mismatched tokenization.

Again, it’s an Instruct model, and it collapses only when the additional SFT / PEFT layer is wired wrong.

2.3 GPT-2 persona-chat: base LM → fine-tune → weird chat gibberish

Older, non-instruct example, but exact same symptom:

“Fine tuning GPT2 on persona chat dataset outputs gibberish” (HF forums). (Hugging Face Forums)
Training loss looks reasonable, but responses are a mess.
Discussion points to:
- how dialogue is concatenated,
- tokenization,
- truncation and decoding config.

This shows: you can get the same kind of collapse without any Instruct model at all if you mishandle the data/format.

2.4 Gemma-3-1B-PT: fine-tuned model OK, but gibberish after quantization

Another closely related pattern:

User fine-tunes google/gemma-3-1b-pt (pretrained/base) with Unsloth + LoRA using ChatML.
Full-precision merged model responds correctly.
After GPTQ/AWQ/BitsAndBytes quantization, all quantized versions produce gibberish or empty outputs. (Hugging Face)

Here the SFT is fine, but the deployment step (quantization / conversion) breaks the model, making it look collapsed.

2.5 BART: fine-tune → gibberish until decoder is fixed

Seq2seq but the same underlying pattern:

“What can cause model.generate (BART) output to be gibberish after fine-tuning?” (HF forums). (Hugging Face Forums)
After fine-tuning, generate() gives nonsense.
Official answer: their decoder inputs were wrong:
- not shifting decoder labels,
- not using correct decoder_start_token_id.
Once they fix decoding, outputs become normal.

Again: the training isn’t inherently doomed; the mechanics were off.

3. What’s actually going on when you see “gibberish + no EOS”

Based on those reports and your description, the failure is almost always one (or a stack) of these:

Tokenizer / vocab / EOS mismatch
- Using a tokenizer that doesn’t match the checkpoint.
- Changing special tokens (e.g., adding chat tokens) and not resizing or retraining the final layer correctly.
- Accidentally changing EOS or pad IDs so they no longer line up.
- Result: token IDs at inference don’t match what the model learned → gibberish, EOS never sampled. (Hugging Face)
Ignoring the model’s chat template
- Instruct model expects something like user: ... assistant: ... <eos>.
- SFT data is fed as raw text or in a different homemade format.
- Inference still uses the “official” chat template (or vice versa).
- Result: the model’s learned structure and your training structure clash → poor logits, nonsense, or strange termination. (Hugging Face)
Over-aggressive learning rate / training schedule
- Especially on already-aligned Instruct models, large LR or too many epochs can push weights far from a good optimum.
- T5 fine-tuning threads show that lr=1e-4, 10 epochs is enough to turn fluent T5 into gibberish. (Hugging Face Forums)
Broken save/load or quantization
- Checkpoints saved without the right PEFT adapters,
- LoRA merges done incorrectly,
- conversions to GGUF/AWQ/GPTQ that scramble weights.
- That’s exactly what you see in the Gemma-3-1B-PT quantization thread and Unsloth GGUF issues. (Hugging Face)
NaNs / numerical instability
- bf16/fp16 with unstable kernels, mis-configured FlashAttention, or bad gradient scales can silently corrupt weights.
- The model will still run but outputs nonsense and EOS probabilities are off.

None of these are about “the model already being Instruct” as a fundamental problem. They are about technical mismatch between your pipeline and the model’s assumptions.

4. Why Instruct models feel more fragile than base ones

This is probably the core of your intuition.

More structure baked in.
Instruct/chat models are trained on very regular patterns:
- role tokens and separators,
- explicit EOS at turn boundaries,
- often assistant-only loss.
That structure becomes part of their internal distribution. If you fine-tune with a different structure (no roles, different separators, missing EOS), you’re effectively telling it “unlearn what you previously knew,” and it can go off the rails faster than a base model that never had that structure.
Less “room to move” without breaking behavior.
A base model is a general LM; you can often push it a bit and it’ll remain a fluent LM. An Instruct model has already been optimized for:
- politeness,
- format,
- safety constraints.
Aggressive SFT or wrong formatting pushes against that alignment and can degrade behavior more visibly.
But they are routinely SFTed successfully.
At the same time:
- TRL’s SFT examples use Instruct models like Llama-3-Instruct, Gemma-it, Qwen-chat as standard starting points for domain SFT. (Hugging Face Forums)
- There are public tutorials specifically for fine-tuning Gemma-3-1B-IT for tasks like financial sentiment, which work fine when the pipeline is correct. (Medium)
So empirically, there is no “instruct → instant collapse” rule; there is “instruct → more sensitive to mistakes.”

5. What this means for your rule-question

Your question:

Is it a rule that if we try to SFT a model that is already Instruct model that it will collapse at the end?

Clear answer:

No, there is no such rule.
There is a strong pattern that:
- if you SFT an Instruct model as if it were a base model,
- ignore its chat template,
- mis-handle special tokens/EOS,
- or use aggressive LR / bad save-load / bad quantization,
then you can absolutely get hard collapse (gibberish, no EOS) like you saw.

But this is a consequence of pipeline mismatch, not an inherent property of “instructness.”

6. Practical takeaway (if you do this again)

To avoid repeating the collapse you saw:

Always confirm what the checkpoint actually is.
- Check the model card: is it *-it, *-Instruct, *-Chat, etc.?
- If yes, assume it needs its documented chat template and special tokens.
Use the official SFT pattern for that family.
- For Gemma, follow Google/HF + TRL SFT guides. (Google AI for Developers)
- For Llama, Qwen, etc., use tokenizer.apply_chat_template and TRL’s SFTTrainer as shown in their examples.
Keep tokenizer + model bound together.
- Same repo ID and revision for both.
- Save and reload them together.
- After loading, check vocab size vs LM head size.
Be gentle with hyperparameters, especially on Instruct.
- Lower LR than you’d use for a base model.
- Fewer epochs / early stopping.
Test mid-training and immediately after saving.
- If mid-training generations look fine but the final checkpoint is broken, suspect save/load or quantization, not SFT itself.

If you line those up, SFT on an Instruct model is just another normal step in the stack (pretrain → SFT → preference optimization), not something that “must” end in collapse.

entfane · November 17, 2025, 7:00am

Aha! I guess I never had any luck w instruct models and managed to collapse them. Always SFTed base ones. Will check my pipelines! Thank u, @John6666

entfane · November 21, 2025, 9:25pm

@John6666 maybe u could help out - I am trying to integrate CoT into Qwen 3b so i finetuned it w Lora, 1 epoch 800 samples. Added special tokens for thinking. Loss seemed to be alright, did see a decrease and training was stable, nevertheless when testing the model it did not generate any thinking tokens and seems like started ro generate trash.
Thinking tokens were added to embed layer and trained like this:

special_tokens = ["<|think_start|>", "<|think_end|>"]
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))
peft_config = LoraConfig(
    trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(special_tokens)},
    r = 16,
    target_modules = ["gate_proj", "down_proj", "up_proj",
                      "q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_rslora = True,
    lora_bias = False
)

Any hints to what could be the issue

John6666 · November 22, 2025, 12:00am

You might want to take a closer look at the specifications for the special tokens and templates of the target model…?

High-level summary first:

No, your code snippet is not obviously “illegal”, and LoRA on Qwen-3B with new thinking tokens is a reasonable idea.
However, there are several very specific places this setup can go wrong:
1. The new <|think_start|> / <|think_end|> tokens may not actually be trained as outputs (e.g., they never appear in labels, or generation is configured to avoid them).
2. The chat template / prompt format may not match what you used during training, especially with Qwen’s own template logic.
3. The dataset size and SFT config (800 samples, 1 epoch) may be too weak or too noisy to reliably teach a new behavior, and can still degrade the base model.
4. There may be a PEFT / trainable_token_indices mismatch (versioning, tied weights, etc.), so your special tokens remain effectively random.

Below is a structured walkthrough of what’s happening, what can break, and how to debug it step by step.

1. What you tried to do (and what PEFT expects)

You did three key things:

Added new CoT markers as special tokens
```
special_tokens = ["<|think_start|>", "<|think_end|>"]
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))
```
This is exactly the recommended pattern when introducing <think> tokens in PEFT docs:
you add tokens and resize embeddings before applying LoRA/PEFT. (Hugging Face)

Used LoRA with trainable_token_indices

peft_config = LoraConfig(
    trainable_token_indices={'embed_tokens': tokenizer.convert_tokens_to_ids(special_tokens)},
    r = 16,
    target_modules = ["gate_proj", "down_proj", "up_proj",
                      "q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_rslora = True,
    lora_bias = False
)

PEFT’s LoRA docs explicitly allow this pattern: pass trainable_token_indices={'embed_tokens': [indices...]} to train only selected embedding rows alongside usual LoRA layers. (Hugging Face)

Trained ~1 epoch on ~800 samples, loss went down, training stable.

So structurally you’re doing what PEFT calls “train tokens alongside LoRA” with new <think> markers, which is a supported pattern. The fact that it then doesn’t emit those tokens and degrades behavior means something else in the pipeline is off.

2. How Qwen-3 thinks about CoT internally (context)

Qwen3 “thinking” models (e.g. Qwen3-4B / Qwen3-30B Thinking variants) use dedicated thinking tokens:

Template for the last turn (from the Qwen docs):

<|im_start|>user
{user content}<|im_end|>
<|im_start|>assistant
<think>
{thinking content}
</think>

{assistant content}<|im_end|>

When you call tokenizer.apply_chat_template(..., enable_thinking=True), the template inserts <think>...</think> around the reasoning block before the final answer. (docs.unsloth.ai)

For training CoT in their official pipeline, they:

wrap reasoning content in <think>...</think>,
align that with the chat template,
and train on large math/reasoning datasets plus reasoning-aware RL. (Qwen)

You are essentially trying to replicate this behavior from scratch on Qwen-3B with custom markers <|think_start|> / <|think_end|> and a very small SFT dataset.

That is doable, but fragile: you need to get all of the following right at once:

Token addition + embedding training.
Data formatting and label masking.
Chat template / inference format.
Hyperparameters and dataset size.

If any one of those is wrong, the model can easily:

never emit your thinking tokens, and/or
degrade into “trash” outputs.

3. Main likely failure modes in your specific setup

3.1 The thinking tokens are never actually supervised as outputs

This is the most common issue in practice:

You see <|think_start|> / <|think_end|> strings in your raw text,
but after tokenization + collator, they may:
- not be in the labels at all (e.g. masked out as part of the “prompt”), or
- be in labels but only extremely rarely (e.g. only in the first token, or truncated).

Typical ways this happens:

SFT trainer masking everything before some “response marker”
- TRL’s SFTTrainer and similar setups can use response_template / assistant_only_loss to start loss only from the assistant’s answer marker.
- If your <|think_start|> block is before the region you treat as the “answer”, or accidentally considered part of the “prompt”, the loss will ignore it.
- Result: the model never learns to emit those tokens, they are only seen as input context.
Custom collator masking special tokens
- Some training code masks all special tokens from the loss.
- If <|think_start|> / <|think_end|> are in additional_special_tokens, and your collator strips or masks all_special_ids from labels, they will never be trained.
Truncation
- If your reasoning block sits early in the sequence but you truncate aggressively at the beginning (or the opposite), it may not reach the label region.

Quick sanity checks:

Take one batch from your dataloader, before Trainer:

batch = next(iter(train_dataloader))
ids = batch["input_ids"][0]
labels = batch["labels"][0]

print(tokenizer.decode(ids))
print(tokenizer.decode(ids[labels != -100]))

Confirm that:
- <|think_start|> and <|think_end|> appear in input_ids, and
- the positions of those IDs in labels are not -100.

If they do not appear in the label region, the model is never trained to generate them; they’ll almost never show up at inference.

3.2 Chat template mismatch: training vs inference

Qwen-style models are chat models with a specific chat template. For Qwen3, there is even an enable_thinking knob that changes the template to insert <think>...</think>. (Qwen)

If you:

train on some custom raw format (maybe user/assistant lines with <|think_start|> ... <|think_end|>),
but at inference time, you use tokenizer.apply_chat_template with the default template,

then the actual prompt structure the model sees at inference:

may be completely different from what it saw during SFT,
may not contain your markers at all,
and may have Qwen’s original system/user/assistant tokens in places you never trained on.

This can easily look like “the model started generating trash” because:

The prompt distribution has changed,
The LoRA is “anchored” on patterns that never appear now,
The sampling runs in a region of the model’s behavior that wasn’t tuned.

Concrete checks:

Print the exact strings you trained on after apply_chat_template (or whatever you used), and compare them to what you pass into generate now.
For the CoT runs, make sure the inference prompt contains your "<|think_start|>" and/or gives a very explicit pattern like:

In your answer, first think step-by-step inside <|think_start|> ... <|think_end|>, then give the final answer.
If you’re using Qwen3’s own template, consider:
- using <think> / </think> instead of custom markers, and
- setting enable_thinking=True in apply_chat_template. (Hugging Face)

3.3 The new tokens’ embeddings are effectively random / not tied

You added new tokens and resized:

tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
model.resize_token_embeddings(len(tokenizer))

By default, resize_token_embeddings will:

add rows to the embedding and output head,
initialize them randomly (unless you manually reinitialize).

The point of trainable_token_indices in LoRA is exactly to make those new rows trainable in a targeted way, without touching the rest of the embedding matrix. (Hugging Face)

Possible issues:

PEFT version not supporting trainable_token_indices correctly
- Older peft versions or forks may ignore this argument or mis-handle it.
- Verify that peft.__version__ is at least ~0.17–0.18 and that LoraConfig’s signature includes trainable_token_indices. (data.safetycli.com)
Not actually training those embeddings
- After wrapping with get_peft_model, inspect model and check:
  - the underlying embedding module is wrapped in a TrainableTokensModel,
  - the parameter count in model.print_trainable_parameters() includes the tokens you expect. (Hugging Face)
Untied LM head
- Some models tie embed_tokens and lm_head weights; some don’t.
- PEFT docs say trainable tokens try to keep tied weights updated automatically if the model follows Transformers tying conventions. (Hugging Face)
- If Qwen’s implementation differs, your embeddings might be updated but the output projection for those tokens might remain random.
Quick test:
```
w_in = model.get_input_embeddings().weight
w_out = model.get_output_embeddings().weight
print(torch.allclose(w_in, w_out))  # should be True if tied
```
If they are not tied and you don’t explicitly fine-tune the LM head rows for those token IDs, it becomes much harder for the model to generate those tokens.

3.4 Dataset size and hyperparameters

You’re:

training LoRA with new tokens,
on ~800 samples,
for 1 epoch.

That is extremely small compared to what Qwen uses for thinking mode (they use large math/reasoning datasets and multi-stage pipelines). (Qwen)

Even if:

the config is correct,
the new tokens are properly trained,
the template matches,

you may still see:

no visible use of thinking tokens in general prompts, because the model hasn’t seen enough examples to “believe” that emitting them is universally useful;
overall quality drop, especially if LR was large or the dataset is noisy, because LoRA can distort behavior even with small data.

You can still experiment with small datasets, but keep expectations realistic:

For 800 samples, you may need:
- a smaller LR (e.g. 1e-5–3e-5 for LoRA on a 3B model),
- more epochs with early stopping,
- strong regularization,
- and carefully curated CoT examples.

3.5 Qwen-specific thinking mode interactions

If you are actually using a Qwen3 model that already supports thinking:

It may already have <think> and </think> in the tokenizer and the chat template. (Qwen)
You then add other special tokens <|think_start|> / <|think_end|>, which:
- don’t match the built-in template,
- may be effectively unused by the built-in generation logic,
- and may also conflict conceptually with the original thinking mode.

In that case, a cleaner approach is:

Use Qwen’s existing <think> / </think> tokens and template.
Fine-tune on CoT examples formatted exactly the way the docs show. (Qwen)

This avoids having to maintain a parallel “shadow” thinking convention.

4. Concrete debugging steps

If you want a practical checklist, here is a minimal sequence that usually pinpoints the issue.

Step 1 – Verify the tokens exist and are in labels

After adding special tokens and resizing, run:

ids = tokenizer.convert_tokens_to_ids(["<|think_start|>", "<|think_end|>"])
print(ids)  # both should be >= 0 and distinct

Inspect a batch:

batch = next(iter(train_dataloader))
ids = batch["input_ids"][0]
labels = batch["labels"][0]

print(tokenizer.decode(ids))
print(tokenizer.decode(ids[labels != -100]))

Confirm:
- <|think_start|> and <|think_end|> appear in input_ids, and
- they are not masked out (labels at those positions are not -100).

If they’re missing from labels, fix your masking / response_template / collator logic first.

Step 2 – Verify LoRA + trainable tokens are actually active

After wrapping the model with PEFT:

from peft import get_peft_model

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Check that:

trainable parameters is > 0,
embedding / trainable tokens are included (if printed in detail).

You can also do a quick gradient check for one batch:

model.train()
batch = next(iter(train_dataloader))
loss = model(input_ids=batch["input_ids"], labels=batch["labels"]).loss
loss.backward()

embed = model.get_input_embeddings().weight
g = embed.grad[tokenizer.convert_tokens_to_ids("<|think_start|>")]
print(g)  # should not be all zeros

This confirms the <|think_start|> row is getting gradient updates.

Step 3 – Ensure train and inference templates match

Look at exactly what you pass to the model during training vs inference.

If you used apply_chat_template with enable_thinking=False during training but True or different payloads at inference, you are not in the same regime. (docs.unsloth.ai)
If you trained on raw “user:\n… assistant:\n<|think_start|>…<|think_end|> answer” strings, do the same at inference as a test, bypassing any chat helper code, to see if the CoT behavior appears.

Step 4 – Compare behavior: base vs adapter

Run the same prompt with:

Base model (no LoRA / adapter disabled):
Fine-tuned model (LoRA enabled):

If the base is fine and the adapter model is trash, then:

Confirm you’re loading the adapter correctly (PeftModel.from_pretrained, not get_peft_model on top of an already-adapted model). (Hugging Face)
Confirm you did not accidentally save/reload in a way that dropped the resized embeddings.

5. A more robust recipe for your use case

If you want a safer setup for “thinking tokens with Qwen-3B”:

Use Qwen’s own thinking tokens if available
- If you’re on a Qwen3 model that already has <think> / </think> and supports enable_thinking, reuse that design. (Qwen)
- Prepare data in the official format:
```
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
<think>
{your chain of thought here}
</think>

{final answer}<|im_end|>
```
If you must use <|think_start|> / <|think_end|>
- Add tokens and resize as you did (this part is correct). (Hugging Face)
- Use trainable_token_indices as you’re doing, but verify your PEFT version and that gradients flow as described above.
- Make sure your training examples really contain multiple instances of these tokens at label positions.
Use gentle hyperparameters
- For a 3B model + small dataset, use a low LR (e.g. 1e-5 – 3e-5 for LoRA) as a starting point; Qwen LoRA experiments on Reddit note that small LRs work better and preserve modes (including thinking). (Reddit)
Scale up data if possible
- 800 examples is small for a qualitative change in behavior like “always emit CoT tokens”; more examples and more variety in reasoning style greatly help.

6. Bottom line

Your use of trainable_token_indices and added special tokens is conceptually correct according to PEFT docs (they even use <think> / </think> as the example). (Hugging Face)
The fact that the model:
- never emits your thinking tokens, and
- begins “generating trash”
almost certainly means:
- the tokens were not actually supervised as outputs,
- or the chat template / inference format does not match training,
- or the PEFT / version / LM head tying setup prevented the new tokens from being properly learned,
- and the LoRA update (even with small data) has disturbed the original distribution.

If you systematically:

check that <|think_start|> / <|think_end|> are in labels,
verify LoRA + trainable tokens are active,
align training and inference templates,
and, if possible, leverage Qwen’s native <think> format,

you should be able to get CoT markers working without the model collapsing into gibberish.

entfane · November 22, 2025, 3:25pm

that seems to be alright. I was checking how deepseek distilled their knowledge into qwen, they were training base model. I start to get some feeling that it might be more complicated to bake in CoT in instruct model and 800 samples might be way too little.

John6666 · November 22, 2025, 5:04pm

Yeah. When trying to make do with 800 samples, your options are limited. Realistic approaches would be to narrow down the target tasks or the number of tensors to train (via LoRA settings, etc.), or tweak the learning rate to achieve early convergence. Alternatively, you could simply try fine-tuning on a smaller model weight.

Your feeling is essentially right, with some nuance:

Yes: it is harder to “bake in” CoT into an Instruct model than into a base model, especially if you want robust, general behavior with explicit think-tags.
Yes: 800 samples is far too little to reliably change a 3B Instruct model’s global behavior around CoT.
That does not mean you can’t do anything useful with 800 samples, but it changes what is realistic.

I’ll break this into:

Why DeepSeek used base models + big data (context).
Why Instruct models are trickier for CoT.
What 800 samples can and cannot realistically achieve.
Feasible strategies with only ~800 samples.

1. What DeepSeek actually did (and why your setup is different)

DeepSeek-R1’s pipeline, in short:

They start from a base model: Qwen2.5-32B, not an instruct/chat model. (arXiv)
They apply RL directly on the base model to explore long CoT (R1-Zero), then refine to DeepSeek-R1. (arXiv)
Then they do distillation: R1 generates large amounts of CoT traces, and those are used to train smaller students (Llama, Qwen). (Fireworks AI)
Public reporting mentions on the order of hundreds of thousands of distillation examples (e.g. ~800k teacher-generated traces) used to tweak other vendors’ models like Qwen and Llama. (The Financial Times.)

So DeepSeek’s recipe is:

big base model → RL to discover CoT → huge teacher CoT dataset → distill into smaller/base students.

Your experiment is:

small Instruct model (Qwen 3B) → LoRA → 800 CoT examples → new -like tokens.

That’s several orders of magnitude different in:

starting point (base vs instruct),
method (RL + distillation vs one small SFT),
and data scale (hundreds of thousands vs hundreds).

Given that gap, your feeling that “my 800 samples are nowhere near DeepSeek’s regime” is absolutely correct.

2. Why CoT is trickier to inject into an Instruct model

2.1 Instruct models already have a strong policy and template

Qwen-style Instruct / chat models already:

use a specific chat template (role tokens, EOS behavior),
are tuned to answer directly and be concise,
and in Qwen3’s case they may already support a built-in “thinking mode” with <think>...</think> and enable_thinking or /think / /no_think controls. (Qwen)

That means:

The model has a strong prior: “user asks → assistant answers cleanly.”
You are trying to impose a new protocol: “assistant opens <|think_start|>, reasons, closes <|think_end|>, then answers.”

With only a tiny number of examples, that new protocol is a weak signal fighting against a strong, already-learned behavior and a fixed template.

By contrast, DeepSeek’s base model did not have explicit chat/CoT behavior baked in; RL and distillation define that behavior from scratch. (arXiv)

2.2 New tokens + new behavior is a double challenge

You’re doing two things at once:

Introducing brand-new tokens (<|think_start|>, <|think_end|>).
Asking the model to change policy to use them consistently before answers.

Training brand-new tokens to be high-probability outputs requires:

that they appear many times in labels (not just inputs), and
that the surrounding pattern is consistently rewarded.

With only 800 samples, those tokens are extremely rare compared to the rest of the vocab, and the model’s prior “answer without them” remains dominant.

Research on CoT distillation (including R1 distillation and follow-up work) assumes large CoT corpora to make these patterns stick. (arXiv)

So again, your intuition that this is harder on an Instruct model and underpowered with 800 samples is correct.

3. What 800 samples can and cannot realistically do

Given the scale and starting point, here’s a realistic view.

3.1 What 800 CoT examples cannot reliably do

With a 3B Instruct model and a standard LoRA setup, 800 examples is not enough to:

robustly “bake in” a new global CoT behavior with explicit <|think_*|> tokens across tasks;
override Qwen’s existing chat template and answer style;
recreate anything close to DeepSeek-R1-style reasoning enhancement on diverse benchmarks.

Empirical SFT studies on 3–7B models typically use tens of thousands to hundreds of thousands of examples even for ordinary instruction tuning. (arXiv)

DeepSeek’s and related distillation efforts (Open-R1, curriculum distillation, etc.) operate at similar or larger scales for CoT. (arXiv)

So expecting 800 samples to globally reprogram an Instruct model’s reasoning style is not realistic.

3.2 What 800 CoT examples can be useful for

They can be useful if you narrow your ambition:

Pipeline sanity check
- Use them to verify that:
  - your tokenizer changes work,
  - the new tokens show up in labels,
  - LoRA is wired correctly and gradients flow.
- In other words, treat this run as debugging/validation, not production behavior change.
Narrow, domain-specific behavior
- If your 800 examples are all from one domain (say, arithmetic word problems or a specific coding style), you might be able to:
  - slightly improve behavior on prompts that look very similar to the training set,
  - especially if you always explicitly ask the model to use <|think_start|> ... <|think_end|> for those tasks.
- This would be local behavior change, not a global CoT upgrade.
Evaluation / validation set
- They are quite valuable as a small, hand-curated eval set:
  - to compare different prompts (plain vs CoT prompts),
  - to compare base vs reasoning models vs your future distilled models,
  - or to drive meta-learning approaches (like sample reweighting techniques such as META-LoRA, which explicitly assume a small, high-quality validation set). (ACLASOLOGY)
Seed data for scaling up via a teacher
- You can use a strong teacher (DeepSeek-R1, DeepSeek-R1-Distill-Qwen, Qwen3-Thinking, etc.) to generate more CoT examples in the same style as your 800 seeds. (Hugging Face)
- The 800 samples define the “look and feel”; the teacher then produces thousands more.
- Then you can do a more serious SFT or distillation run.
Prompt / scaffolding design
- Following “make any model reasoning” style approaches, you can:
  - design prompts that show one or two of your 800 CoT examples as in-context demonstrations,
  - use them to test how far you can get with prompting alone (no SFT). (Hugging Face)

In all of these uses, 800 samples are useful—but as tools (for eval, seed, or local tweaks), not as the main fuel for global CoT training.

4. Feasible way forward with only these samples

Given your constraints, here is a feasible and realistic plan.

4.1 Treat your 800 examples as “gold” evaluation and seed data

Keep them mostly as held-out eval or teacher seed rather than training everything on them.
Use them to:
- measure how well different models or prompts reproduce your desired reasoning style,
- benchmark DeepSeek-R1-Distill-Qwen or Qwen3-Thinking models against your current Qwen-3B-Instruct. (Hugging Face)

4.2 If you still want to fine-tune on them

You can still do a very light LoRA SFT, but with limited expectations:

Objective: small, domain-specific improvement; not global CoT.
Conditions:
- keep LR low (e.g. 1e-5–3e-5),
- use a very small number of steps and early stopping,
- monitor that general response quality on non-CoT tasks does not collapse.
Use your CoT examples only for the domain where you will actually apply them (e.g. specific math tasks).

Think of this as “adding a trick” to the model, not redesigning its reasoning brain.

4.3 Get explicit CoT by switching models + prompt, not by retraining

If explicit CoT is the main goal (more than “I must use these exact tokens”):

Use a model that already has thinking mode baked in, e.g. Qwen3-Thinking-4B or DeepSeek-R1-Distill-Qwen. (Hugging Face)
Then:
- use your 800 samples as eval and a small SFT for style or domain, if needed;
- rely on the base reasoning behavior already present, rather than trying to create it from scratch.

This is exactly the “feasible with small data” path: start from a reasoning-oriented model and do small domain SFT, instead of starting from an Instruct model and trying to create reasoning behavior.

4.4 Use the 800 samples as a starting point to synthesize more data

If you eventually want your own CoT protocol (e.g. <|think_start|> ... <|think_end|> and a specific style), the realistic route is:

Use your 800 examples as style anchors.
Use a stronger teacher (R1, R1-distill, Qwen3-Thinking) to generate thousands of new CoT examples that imitate this format. (Fireworks AI)
Fine-tune a base or reasoning model on this larger synthetic dataset.

Your 800 examples then become the “taste” you want, not the entire diet.

5. Direct answer to your question

“How about this feeling? If this feeling is correct, feasible way with only this samples?”

Your feeling is correct in the practical sense:
- Injecting CoT into an Instruct model is more complicated than doing it on a base model, because you’re fighting an existing chat/answer prior and template—especially for Qwen3, which already has its own thinking mechanism. (Qwen)
- 800 samples is far too little to robustly “bake in” CoT as a general behavior via SFT/LoRA, compared to the large-scale distillation and RL pipelines used by DeepSeek and similar projects. (arXiv)
With only these 800 samples, a feasible way is:
- do not rely on them to fully reprogram the model’s reasoning;
- instead, use them as:
  - a small evaluation/validation set,
  - seeds for teacher-generated synthetic CoT data,
  - and possibly a small, careful LoRA SFT for narrow domain/style tuning,
- while relying on a model that already has good reasoning (Qwen3-Thinking, DeepSeek-R1-Distill-Qwen, etc.) for the heavy lifting. (Hugging Face)

That way, you respect what small data can and cannot do—and you align your setup much more closely with what DeepSeek-style distillation pipelines actually look like.

Topic		Replies	Views
Instruction tuning a pre-trained base model 🤗AutoTrain	0	84	December 18, 2024
SFTTrainer loss function and formatting_func Beginners	7	65	December 6, 2025
Fine-tune model with CoT Intermediate	1	525	January 27, 2025
Whats happening in the SFT trainer? Beginners	15	3041	July 16, 2025
SFTTrainer work but without result Beginners	1	34	October 13, 2025