How to understand the special tokens?

Whether increasing or decreasing tokens, I often see resize_token_embeddings.


You are basically right: if you just “add tokens + resize embeddings” and don’t explicitly make those new rows trainable and well-used in the data, their embeddings can stay bad and hurt performance.

Let me break down both parts of your question:

  1. “Is there an interface which can reduce token?” (shrink back / remove tokens)
  2. “Maybe I didn’t fine-tune the new token’s embedding” (how to fix that, especially with Unsloth + vLLM)

1. Can you “reduce tokens” after adding them?

1.1 What the official API supports

On the model side, Transformers exposes exactly one official hook for changing vocab size:

model.resize_token_embeddings(new_num_tokens)

This works for:

  • Increasing vocab (add rows to embedding + lm_head).
  • Decreasing vocab (drop rows from the end). (Hugging Face)

On the tokenizer side, there is no nice high-level “remove these tokens and reindex everything” API. You can:

  • Add tokens (tokenizer.add_tokens, tokenizer.add_special_tokens). (Hugging Face)
  • Manually edit tokenizer.vocab, tokenizer.encoder, added_tokens_encoder (for some tokenizers) if you really want to hack things, but this is not supported and easy to break. A long-standing GitHub issue on “How to remove token?” is explicitly marked “wontfix”; the suggested workaround is manual dictionary surgery. (GitHub)

So:

  • There is a way to change the size of the embedding matrix (resize_token_embeddings).
  • There is no safe, general interface to “remove arbitrary tokens from the middle of the vocab and renumber everything”.

1.2 What happens if you shrink vocab size?

If you call:

# Suppose tokenizer has grown to 151670 tokens:
model.resize_token_embeddings(new_num_tokens=151640)

then:

  • The model’s input embeddings and output lm_head are re-created with the first 151640 rows.
  • Whatever was in the last rows is simply dropped.

This assumes:

  • The tokens you want to remove are exactly the last ones (which is usually true if you only ever appended new tokens).
  • The tokenizer is updated consistently so that its vocab size is also 151640 and it never emits IDs ≥ 151640.

Even then, shrinking has some risk:

  • There are bug reports where shrinking embeddings caused shape mismatches or CUDA crashes for some architectures (T5, etc.). (GitHub)
  • If you saved checkpoints at the larger size and then load them with a smaller-size model, you must handle the mismatch manually.

There is also a community recipe that shows how to manually remove unwanted tokens and corresponding rows from the embeddings to reduce model size, but it’s advanced, easy to get wrong, and specific to a given model/tokenizer. (Medium)

1.3 Practical answer for your use case

Given that you:

  • Added a small number of tokens (<think>, <tool_response>, etc.).
  • Are not happy with the fine-tuned result.

The simplest safe options are:

  1. Revert to the original base model + original tokenizer and re-finetune without those tokens (if you decide you don’t actually need them as separate IDs).

  2. Or keep the extra tokens, but fix the embedding training, instead of trying to shrink the vocab:

    • That avoids all the brittle “remove tokens” surgery.
    • It’s what most people do when customizing control tokens.

If you really want to “undo” the addition:

  • Restore the original tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) from the base model.
  • Restore the original base model weights (or re-download them).
  • Then start a new fine-tune run from that clean base.

Trying to surgically remove tokens from your current checkpoint is possible but not worth it unless you’re very comfortable doing custom model surgery.


2. Why your new tokens may not be trained (and how to fix it)

Your suspicion is very plausible:

“maybe I did not finetune the new token’s embedding.”

Especially in a LoRA / PEFT / Unsloth setup, this happens a lot.

2.1 How new token embeddings are initialized

Standard Hugging Face behavior:

  1. You add tokens to the tokenizer:

    num_new = tokenizer.add_tokens(["<think>", "</think>", "<tool_response>", "</tool_response>"])
    
  2. You resize embeddings:

    model.resize_token_embeddings(len(tokenizer))
    
  3. Transformers creates a bigger embedding matrix:

    • Old rows are copied over.
    • New rows are initialized randomly (often from a normal distribution). (Hugging Face)

HF forum discussions and docs explicitly say: to make these useful, you must fine-tune so that the new rows get gradients; otherwise they stay near random. (Hugging Face Forums)

Unsloth adds a more advanced initializer: in their resizing code, new embeddings are initialized from a multivariate normal with the mean and covariance of the old embeddings (vocabulary expansion trick from Hewitt’s paper). (GitHub)
That’s better than pure random, but it still needs training to become meaningful.

2.2 Why fine-tuning may not be updating the new embeddings

There are two main failure modes:

(A) The data doesn’t contain the new tokens enough

  • Gradients for a token embedding are only generated when that token ID appears in the input / labels.
  • If <think> appears just a handful of times, its embedding hardly moves.
  • HF forum discussions about “training embeddings of tokens” emphasize that you need enough occurrences, otherwise the embeddings stay poor. (Hugging Face Forums)

You can fix this by:

  • Making sure your training prompts actually contain <think> / <tool_response> many times.
  • Possibly oversampling examples that use them.

(B) You are using LoRA / PEFT and embeddings are frozen

With PEFT / LoRA (which Unsloth uses under the hood):

  • By default, the base model weights (including input embeddings) are frozen.
  • Only adapter parameters (low-rank matrices) are trainable.

If you add new tokens:

  • The new embedding rows live in the base embedding matrix (not in the LoRA adapters).
  • If the embedding module is frozen, those rows never get updated — they stay in their initial distribution.

There is a direct StackOverflow + PEFT answer about this exact point: if you resize token embeddings and only do LoRA, you need to mark embeddings as trainable via modules_to_save=["embed_tokens"] in LoraConfig, otherwise the base embeddings (including new tokens) remain untrained. (Stack Overflow)

With Unsloth specifically:

  • You typically load the model via FastLanguageModel.from_pretrained(...). (Medium)
  • If you add tokens and call model.resize_token_embeddings(len(tokenizer)), the new rows are created in the base embeddings.
  • If your LoRA config doesn’t include embed_tokens (or similar) as a module to save/train, those new rows will not be updated.

So yes: your suspicion is likely correct — in a typical Unsloth LoRA setup, new tokens do not get useful embeddings unless you explicitly configure them to be trainable.

2.3 How to train the new embeddings properly

The general recipe (HF + PEFT):

  1. Add tokens and resize embeddings before creating / patching the PEFT model.

  2. In your LoRA / PEFT config, include embeddings as modules to save/train, for example:

    from peft import LoraConfig
    
    lora_config = LoraConfig(
        r=...,
        lora_alpha=...,
        lora_dropout=...,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        modules_to_save=["embed_tokens"],   # <-- important
    )
    

    This makes the embedding layer trainable and ensures updated embeddings are saved in your PEFT checkpoint. (Stack Overflow)

  3. Make sure your training data actually uses <think> / <tool_response> plenty of times.

  4. Train for enough steps so those embeddings converge.

Unsloth has helpers like add_new_tokens(model, tokenizer, new_tokens=...) that integrate resizing and initialization; you still need to ensure LoRA config allows embeddings to update. (GitHub)


3. How vLLM fits into this

You mentioned:

“I use vllm to inference after finetuning, I don’t know which it use, either”

vLLM’s documentation is clear:

  • Tokenizer: vLLM simply uses the Hugging Face tokenizer loaded with AutoTokenizer. (docs.vllm.ai)
  • Model weights: it loads standard HF model checkpoints (PyTorch / safetensors).

So:

  • Whatever tokenizer + vocab + embeddings you used during Unsloth fine-tuning must be the same ones you ship to vLLM.
  • There is no separate “vLLM-only interface” to reduce tokens; vLLM just respects whatever vocab and token IDs exist in the Hugging Face artifacts you give it.

If the new tokens have bad embeddings, vLLM will produce bad generations involving them, exactly like Transformers would.


4. What I would do in your situation

Given your notes, here is a concrete plan:

Option 1 – Keep the new tokens, fix training (recommended if you need them)

  1. Start from the original base model again (or from your current, but the cleanest is the base).
  2. Add your custom tokens (<think>, <tool_response>, etc.) using tokenizer.add_tokens (not add_special_tokens) and model.resize_token_embeddings(len(tokenizer)).
  3. In Unsloth / PEFT, configure LoRA with modules_to_save=["embed_tokens"] (and optionally include embed_tokens in target_modules if needed). (Stack Overflow)
  4. Make sure your dataset uses the new tokens very frequently.
  5. Fine-tune again.
  6. Export model + tokenizer together, and point vLLM at that directory.

Option 2 – Decide you don’t need the new tokens and revert

  1. Restore the original tokenizer (tokenizer.json, tokenizer_config.json, special_tokens_map.json) from the base model. (stephendiehl.com)
  2. Restore the original base checkpoint.
  3. Treat <think> and <tool_response> as multi-token strings instead of new IDs (just literal text in the prompt), accepting slightly longer sequences.
  4. Fine-tune without changing vocab.

This avoids all embedding and resizing complexity.

Option 3 – Advanced: actually shrink embeddings

Only if you are comfortable doing low-level surgery:

  • Confirm that the tokens you added are exactly at the end of the vocab.
  • Adjust tokenizer so those tokens are no longer present (or rebuild a tokenizer without them).
  • Call model.resize_token_embeddings(new_smaller_vocab_size) to drop the trailing rows.
  • Ensure lm_head and any tied embeddings are shrunk consistently. (Medium)

This is doable but fragile, and usually not necessary unless you are chasing model size reductions for deployment.


Summary

  • There is no clean, general “remove token” API. You only have model.resize_token_embeddings(new_num_tokens) and manual tokenizer surgery; shrinking is possible but brittle and only removes rows at the end. (Hugging Face)

  • Your poor fine-tuning is very likely because the new token embeddings were never really trained:

    • New rows are random/mean-initialized.
    • LoRA / PEFT often keeps embeddings frozen unless you explicitly mark them as trainable (modules_to_save=["embed_tokens"]).
    • If the new tokens don’t appear frequently in the training data, their embeddings get almost no gradient. (Hugging Face Forums)
  • vLLM simply uses the same Hugging Face tokenizer and model you trained, so any bad embeddings will show up in its outputs too. (docs.vllm.ai)

1 Like