Llama 3.2 Prompting with pipeline - system message

aboodmm · September 30, 2025, 7:03pm

I am applying a transformers pipeline to prompt Llama 3.2 with several dozens of prompts.

My question is, is it correct to zip a system prompt in along with each prompt into the pipeline, or simply have the system prompt as the first message in the pipeline

Which approach would perform better based on your experience?

Here’s my code.

for cur in alert_messages:

    messages = [system_message, cur]

    outputs = pipe(

        messages,

    results.append(outputs[0]["generated_text"])

John6666 · September 30, 2025, 9:25pm

Regarding the standard Pipeline class in Transoformers, since it has no memory, you must pass it every time.

Use a system message as the first message in every chat you send to the pipeline. Pipelines are stateless, so they won’t remember a prior system prompt. Chat templates also assume the system message sits at the front. (Hugging Face)

What to do

Keep: messages = [system_message, cur] for each prompt.
Prefer batching to cut overhead.
If you need more speed, use a server that prefix-caches identical system prompts.

Minimal fixes

Batched calls in Transformers

# Docs: https://huggingface.co/docs/transformers/en/main_classes/pipelines
# Chat templates: https://huggingface.co/docs/transformers/en/chat_templating
chats = [[system_message, cur] for cur in alert_messages]
outs = pipe(chats, max_new_tokens=256)  # batch many chats at once
results = [o["generated_text"] for o in outs]

(Hugging Face)

If you ever build prompts manually, still put system first

# Docs: https://huggingface.co/docs/transformers/en/chat_templating
inputs = tokenizer.apply_chat_template(
    [
      {"role": "system", "content": SYSTEM_TXT},
      {"role": "user", "content": USER_TXT},
    ],
    add_generation_prompt=True,
    return_tensors="pt",
)

(Hugging Face)

Performance realities

Sending the system prompt per chat is required. There is no “once globally” in a plain pipeline. (Hugging Face)
For throughput, run the same chats on a backend with prefix caching so the shared system prefix is reused:
- vLLM “Automatic Prefix Caching.” (VLLM)
- llama.cpp server with cache_prompt=true. (app.semanticdiff.com)

Restated

Correct: include the system prompt as the first message in each call.
Better performance: batch in Transformers; or use a runtime that reuses the identical system prefix across requests. (Hugging Face)

Short references

Transformers pipelines and batching. (Hugging Face)
Chat templates and system-first placement. (Hugging Face)
vLLM prefix caching overview. (VLLM)
llama.cpp prompt cache usage. (app.semanticdiff.com)

Topic		Replies	Views
Trying to understand system prompts with Llama 2 and transformers interface 🤗Transformers	9	46631	October 19, 2024
Pipeline Llama3 Text Generation Saving a Memory/Cache Beginners	9	2399	January 5, 2025
How do I send system prompts using inference api serverless, llama3 8b instruct model Beginners	2	1372	April 29, 2024
Llama 2 repeats its prompt as output without answering the prompt 🤗Transformers	3	3887	September 30, 2024
How to set Llama-2-Chat prompt context Models	2	15580	October 18, 2023

Llama 3.2 Prompting with pipeline - system message

What to do

Minimal fixes

Performance realities

Restated

Related topics