I don’t have any data science knowledge whatsoever, but I think we can manage if we just do some basic preprocessing in Python… Functions for data processing and shaping are usually available somewhere in the libraries.
Use one rendered text column for SFT. Do not map instruction/input/output separately. Convert your rows to the model’s chat format, save as a single-column dataset, and map text → text in AutoTrain. (Hugging Face)
Beginner guide: LLM SFT with AutoTrain
1) Choose trainer and model
- Trainer: SFT in AutoTrain Advanced. (Hugging Face)
- Model: pick your chat model and its tokenizer, e.g.
meta-llama/Llama-3.1-8B-Instruct. (Hugging Face)
2) Know the accepted dataset shapes
SFTTrainer accepts either:
- single-column:
{"text": "...final rendered conversation..."}, or
- two-column:
{"prompt": "...", "completion": "..."}.
AutoTrain commonly uses the single text column for chat SFT. (Hugging Face)
3) Render your triples into one training string
- Build messages: user =
instruction + ("\n\n" + input if present); assistant = output.
- Render with the tokenizer’s chat template:
apply_chat_template(messages, tokenize=False, add_generation_prompt=False).
- Save one column named text. (Hugging Face)
4) Minimal preprocessing code
from datasets import load_dataset
from transformers import AutoTokenizer
import pandas as pd
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def render_row(r):
user = r["instruction"] + (("\n\n" + r["input"]) if r.get("input") else "")
messages = [{"role":"user","content":user},
{"role":"assistant","content":r["output"]}]
return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
ds = load_dataset("tatsu-lab/alpaca", split="train") # replace with your data
df = pd.DataFrame({"text": [render_row(x) for x in ds]})
df.to_csv("autotrain_llm_sft.csv", index=False)
apply_chat_template ensures the exact prompt tokens and headers the model expects. (Hugging Face)
5) Create the AutoTrain job
UI: upload CSV/JSONL, set Column Mapping → text → text, choose LLM finetuning → SFT. (Hugging Face)
CLI (reliable, explicit):
pip install autotrain-advanced
autotrain llm \
--train \
--project-name llama31-alpaca-sft \
--model meta-llama/Llama-3.1-8B-Instruct \
--data-path ./ \
--train-split train \
--text-column text \
--trainer sft \
--use-peft \
--lora-r 16 --lora-alpha 32 --lora-dropout 0.05 \
--batch-size 4 --gradient-accumulation 8 \
--lr 2e-4 --epochs 3 --bf16 \
--max-seq-length 4096
Flags mirror documented AutoTrain usage. Adjust batch and GA for VRAM. (Hugging Face)
6) Inference must match training
At generation, build messages and call the same tokenizer’s chat template to format the prompt before generate. Template mismatches degrade outputs. Llama 3.1 has known header nuances; verify your output. (Hugging Face)
7) When you’d use more columns
Only if you pick a different trainer or format:
- Prompt+completion SFT: map
prompt and completion. (Hugging Face)
- DPO/ORPO: needs
prompt, chosen, rejected. AutoTrain exposes those roles in column mapping. (Hugging Face)
8) Quick checks
- Open one CSV row. Confirm it contains the full rendered conversation string. (Hugging Face)
- If UI mapping is unclear, switch to CLI and set
--text-column text. (Hugging Face)
- If outputs look odd, print a rendered example, confirm chat headers match the model card’s template. (Llama)
References
AutoTrain LLM finetuning and column mapping, TRL SFT dataset formats, and chat templating docs. (Hugging Face)