Llama 3.2 Prompting with pipeline - system message

I am applying a transformers pipeline to prompt Llama 3.2 with several dozens of prompts.

My question is, is it correct to zip a system prompt in along with each prompt into the pipeline, or simply have the system prompt as the first message in the pipeline

Which approach would perform better based on your experience?

Here’s my code.

for cur in alert_messages:
    messages = [system_message, cur]
    outputs = pipe(
        messages,
    )
    results.append(outputs[0]["generated_text"])
1 Like

Regarding the standard Pipeline class in Transoformers, since it has no memory, you must pass it every time.


Use a system message as the first message in every chat you send to the pipeline. Pipelines are stateless, so they won’t remember a prior system prompt. Chat templates also assume the system message sits at the front. (Hugging Face)

What to do

  • Keep: messages = [system_message, cur] for each prompt.
  • Prefer batching to cut overhead.
  • If you need more speed, use a server that prefix-caches identical system prompts.

Minimal fixes

Batched calls in Transformers

# Docs: https://huggingface.co/docs/transformers/en/main_classes/pipelines
# Chat templates: https://huggingface.co/docs/transformers/en/chat_templating
chats = [[system_message, cur] for cur in alert_messages]
outs = pipe(chats, max_new_tokens=256)  # batch many chats at once
results = [o["generated_text"] for o in outs]

(Hugging Face)

If you ever build prompts manually, still put system first

# Docs: https://huggingface.co/docs/transformers/en/chat_templating
inputs = tokenizer.apply_chat_template(
    [
      {"role": "system", "content": SYSTEM_TXT},
      {"role": "user", "content": USER_TXT},
    ],
    add_generation_prompt=True,
    return_tensors="pt",
)

(Hugging Face)

Performance realities

  • Sending the system prompt per chat is required. There is no “once globally” in a plain pipeline. (Hugging Face)

  • For throughput, run the same chats on a backend with prefix caching so the shared system prefix is reused:

Restated

  • Correct: include the system prompt as the first message in each call.
  • Better performance: batch in Transformers; or use a runtime that reuses the identical system prefix across requests. (Hugging Face)

Short references