I am applying a transformers pipeline to prompt Llama 3.2 with several dozens of prompts.
My question is, is it correct to zip a system prompt in along with each prompt into the pipeline, or simply have the system prompt as the first message in the pipeline
Which approach would perform better based on your experience?
Regarding the standard Pipeline class in Transoformers, since it has no memory, you must pass it every time.
Use a system message as the first message in every chat you send to the pipeline. Pipelines are stateless, so they won’t remember a prior system prompt. Chat templates also assume the system message sits at the front. (Hugging Face)
What to do
Keep: messages = [system_message, cur] for each prompt.
Prefer batching to cut overhead.
If you need more speed, use a server that prefix-caches identical system prompts.
Minimal fixes
Batched calls in Transformers
# Docs: https://huggingface.co/docs/transformers/en/main_classes/pipelines
# Chat templates: https://huggingface.co/docs/transformers/en/chat_templating
chats = [[system_message, cur] for cur in alert_messages]
outs = pipe(chats, max_new_tokens=256) # batch many chats at once
results = [o["generated_text"] for o in outs]