:<|im_end|>\n<|im_start|>assistant generated despite predict model used

Hmm… for now.


Your predict model behaves like a chat model because the training pipeline quietly turned your {prompt, completion} data into chat-formatted text with <|im_start|> / <|im_end|>, and you’re now decoding that raw structure at inference.

So the model is doing what it was trained to do. The surprising part is what TRL + the tokenizer did under the hood, not the weights themselves.

I’ll walk through:

  • What TxGemma “predict” actually is
  • What SFTTrainer does with {prompt, completion}
  • Why you see :<|im_end|>\n<|im_start|>assistant
  • How to fix decoding, or how to reconfigure training if you really want a pure completion model

1. TxGemma “predict” vs “chat”: what that split really means

From Google’s docs and model card, TxGemma is a family of models (predict and chat variants) built on Gemma 2, aimed at therapeutic tasks. (Google Developers Blog)

Important points:

  • “Predict” variants (like txgemma-2b-predict) are optimized for prediction tasks and typically exposed as “base-style” models.
  • “Chat” variants add extra instruction-tuning data to support multi-turn conversation and explanations, at a small cost in raw predictive performance. (Google Developers Blog)

However:

  • Predict vs chat is not enforced at the tokenizer/vocabulary level.
  • The tokenizer for Gemma-family models still usually defines a chat template and includes special tokens like <|im_start|> and <|im_end|> (see, for example, google/gemma-2-2b-it which exposes a chat template using those tokens). (Hugging Face)

So even a “predict” checkpoint can:

  • Understand those special tokens, and
  • Learn to use them if your fine-tuning data includes them.

The key question is: who put those tokens into your data? That’s usually not you, it’s TRL’s SFTTrainer plus the tokenizer.


2. What TRL’s SFTTrainer actually does with {prompt, completion}

You’re using:

  • Model: TxGemma-2b predict
  • Data: {"prompt": ..., "completion": ...}
  • SFTConfig: mostly defaults

This matches what TRL calls a prompt–completion dataset type. The TRL docs say:

SFT supports both language modeling and prompt-completion datasets. The SFTTrainer is compatible with both standard and conversational dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset. (Hugging Face)

There is also an explicit statement in an earlier TRL issue about instruction-style {prompt, completion} data:

The SFTTrainer will then format the dataset for you using the defined format from the model’s tokenizer with the apply_chat_template method. (GitHub)

Putting this together:

  1. TRL recognizes your dataset as prompt–completion.

  2. It converts each pair into an internal “conversation” like:

    • user: prompt
    • assistant: completion
  3. Then, if the tokenizer has a chat_template, it runs:

    tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=<something>
    )
    
  4. For Gemma-style templates, that expands to a string with markers such as:

    <|im_start|>user
    <PROMPT><|im_end|>
    <|im_start|>assistant
    <COMPLETION><|im_end|>
    

So even though your original dataset was “clean completion data”, the preprocessed text the model actually sees in training is chat-style with <|im_start|> / <|im_end|>.

This behavior (auto ChatML / chat-template formatting) has been discussed multiple times:

  • “SFTTrainer: Why do we always switch to chatML?” where a user noticed that _prepare_dataset keeps trying to convert data into chat format via maybe_convert_to_chatml and maybe_apply_chat_template. (GitHub)
  • HF forum thread “SFT Trainer and chat templates”, asking explicitly whether SFTTrainer automatically applies the tokenizer’s chat_template for standard formats, and the answer is effectively “yes, if the template exists.” (Hugging Face Forums)

So even if you never mention <|im_start|> / <|im_end|> in your code, SFTTrainer + tokenizer may inject them in your training text.


3. How that training setup explains your inference artifacts

During training, given the above behavior, the model sees something like:

<|im_start|>user
<prompt><|im_end|>
<|im_start|>assistant
<completion><|im_end|>

plus whatever padding/EOS logic SFTTrainer adds. In particular:

  • The input context always contains user and assistant segments wrapped by these tags.
  • With prompt–completion type, SFTTrainer usually uses completion-only loss, so the loss is computed only on the assistant’s tokens, not the prompt’s, but the model still conditions on the full tagged context. (Hugging Face)

Two effects at inference time:

3.1 Prompt format mismatch

At inference you likely do something like:

inputs = tokenizer(user_text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=...)
decoded = tokenizer.decode(output[0], skip_special_tokens=False)

So the model receives bare text, not:

<|im_start|>user
user_text<|im_end|>
<|im_start|>assistant

But during SFT it always saw that structure. It learned:

  • After <|im_start|>user ... <|im_end|> comes <|im_start|>assistant.
  • After the assistant content ends, it often emits <|im_end|>.

When you now ask it to generate from a raw prompt, it will often:

  • First “repair” the prompt into the expected chat structure by generating something that includes <|im_end|> and <|im_start|>assistant.
  • Then start your actual answer.

This is why you see sequences like:

:<|im_end|>\n<|im_start|>assistant ...

Those markers are just the structural boundary it learned.

3.2 Raw decoding of special tokens

By default, tokenizer.decode(..., skip_special_tokens=False) will leave special tokens in the output string. For chat/instruction models, standard practice is to:

  • Either decode with skip_special_tokens=True, or
  • Manually split at a sentinel like <|im_end|> and drop everything afterward.

This pattern appears in multiple chat examples and in discussions about EOS handling, including Qwen2.5 issues where double <|im_end|> + newline appear because of template + EOS logic. (GitHub)

So the presence of these raw tags in your log is not proof the model is a “chat” checkpoint; it’s just proof you’re looking at the unfiltered chat-formatted text it was fine-tuned on.


4. Why the “predict” vs “chat” label does not prevent this

The TxGemma pages and model card treat “predict” and “chat” as different variants in the suite. (Google DeepMind)

But TRL doesn’t have any special casing like:

  • “If model is predict, do not apply chat_template.”

Instead, its main triggers are:

  • Does the tokenizer have a chat_template?
  • Is the dataset in a standard instruction format (messages, prompt/completion)?

If yes, it will try to build conversations and call apply_chat_template. (Hugging Face)

So from TRL’s perspective:

  • TxGemma-2b-predict + {prompt, completion} + chat_template present
    ⇒ “Great, this is a chat training setup.”

Your expectation:

  • “Predict model + completion dataset + default SFTConfig ⇒ pure completion training”

does not match current design. That mismatch is what you’re experiencing.


5. What you can do now

5.1 If you are okay treating this as a chat model

This is the lowest-friction route.

  1. Use the chat template at inference.

    Instead of feeding bare text, construct messages and apply the template:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    
    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    model = AutoModelForCausalLM.from_pretrained("path/to/your/finetuned/model")
    
    messages = [{"role": "user", "content": user_prompt}]
    prompt_text = tok.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    inputs = tok(prompt_text, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=256)
    raw = tok.decode(out[0], skip_special_tokens=False)
    
  2. Strip at <|im_end|> or use skip_special_tokens=True.

    For example:

    answer_part = raw.split("<|im_end|>")[0]
    # optionally also split off the assistant prefix:
    answer = answer_part.split("<|im_start|>assistant")[-1].strip()
    

    or simply:

    answer = tok.decode(out[0], skip_special_tokens=True)
    

As long as you align inference format with training format, the weird :<|im_end|>\n<|im_start|>assistant fragments disappear from the user-facing text.

5.2 If you want a true “completion-only” predict-style model

Then you need to prevent the automatic chat formatting.

Two main approaches:

  1. Disable chat_template before building SFTTrainer.

    Load the tokenizer and wipe its template so maybe_apply_chat_template has nothing to use:

    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    tok.chat_template = None
    if "chat_template" in getattr(tok, "init_kwargs", {}):
        tok.init_kwargs.pop("chat_template")
    

    Then pass this tokenizer into SFTTrainer. With no chat template defined, SFTTrainer will not wrap your examples in <|im_start|> / <|im_end|> and will instead use a simpler prompt + completion + EOS scheme for prompt–completion. (GitHub)

  2. Skip SFTTrainer’s automatic formatting entirely.

    • Build a "text" field yourself:

      def build_text(example):
          return example["prompt"] + example["completion"]
      
      ds = ds.map(lambda ex: {"text": build_text(ex)})
      
    • Either:

      • use SFTTrainer with dataset_text_field="text" and no chat_template, or
      • fall back to plain Trainer plus a DataCollatorForCompletionOnlyLM if you want explicit control of loss masking. (Hugging Face)

In either case, you will need to retrain from the base TxGemma-2b-predict checkpoint, because the current fine-tuned weights have already learned that chat structure.


6. Sanity checks you can run on your side

To be fully convinced:

  1. Print the tokenizer’s chat template.

    tok = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
    print(tok.chat_template)
    

    If you see a Jinja template producing <|im_start|>user, <|im_start|>assistant, <|im_end|>, that’s exactly where those tokens come from. Similar templates exist for Gemma-2 IT models. (Hugging Face)

  2. Inspect the processed training dataset from SFTTrainer.

    After constructing the trainer, look at trainer.train_dataset[0] or the tokenized batch. You should see that the "text" or token sequence already contains <|im_start|> / <|im_end|>. This confirms the trainer, not your raw dataset, injected them. (GitHub)

  3. Compare decoding options.

    Generate once and examine:

    out = model.generate(**inputs, max_new_tokens=64)
    raw = tok.decode(out[0], skip_special_tokens=False)
    clean = tok.decode(out[0], skip_special_tokens=True)
    print("RAW:", raw)
    print("CLEAN:", clean)
    

    If RAW shows :<|im_end|>\n<|im_start|>assistant but CLEAN does not, then the tags are indeed just special tokens that the default decoding is exposing.


Compressed summary

  • TRL’s SFTTrainer treats {prompt, completion} as instruction data and, when a tokenizer has a chat_template, auto-applies that template using apply_chat_template. (Hugging Face)

  • For Gemma/TxGemma, that template is ChatML-like and introduces <|im_start|> / <|im_end|> and role tags.

  • Your TxGemma-2b-predict checkpoint was therefore fine-tuned in a chat-style format, even though your raw dataset was plain completion.

  • At inference you are feeding bare prompts and decoding raw text, so the model generates the missing chat structure and you see :<|im_end|>\n<|im_start|>assistant.

  • To resolve it:

    • Either treat the model as chat-style: use apply_chat_template at inference and strip at <|im_end|> or use skip_special_tokens=True.
    • Or retrain from the base predict model with the chat template disabled or with a fully manual completion-only formatting pipeline.