Auto Train with alpaca model data set

Hi there,

I’m new both on this forum and huggingface world. Please go easy on me :slight_smile:
I have a question to ask. I want to use auto train for fine tune a model like meta-llama/Llama-3.1-8B-Instruct. I have a data set which is in alpaca model with instruction, input and output columns.

Questions are;

I couldn’t find a good document or example in order to learn how to fine tune a model with using this type of model.

None of the information buttons are working on the Auto Train screen like the one above task or parameter combo-box.

How can I put more fields in column mapping section? There is only one right now. I think I should put instruction, input and output columns.

If there is any good documentation, please share it with me. So, I can started to learn some stuff.

Best regards,
Yunus Emre

1 Like

Hmm… Try this. And for AutoTrain CSV data format.

Hi @John6666 ,

Thank you for your response. I’ve made some tries regarding the links which you’ve shared. I believe it is better now but I still have some questions. If you redirect me it would be really helpful.

For LLM SFT task I need to combine the columns from data set and put them in one column as text in the csv. The point which I don’t understand how LLM will understand which column means what? I saw there are few other data sets in here for example one of them has 3 columns but other one has 7. Is there anyway to differentiate which data set should use in which case or is this requires a knowledge of data scientists?

Best regards,
Yunus

1 Like

I don’t have any data science knowledge whatsoever, but I think we can manage if we just do some basic preprocessing in Python… Functions for data processing and shaping are usually available somewhere in the libraries.


Use one rendered text column for SFT. Do not map instruction/input/output separately. Convert your rows to the model’s chat format, save as a single-column dataset, and map text → text in AutoTrain. (Hugging Face)

Beginner guide: LLM SFT with AutoTrain

1) Choose trainer and model

  • Trainer: SFT in AutoTrain Advanced. (Hugging Face)
  • Model: pick your chat model and its tokenizer, e.g. meta-llama/Llama-3.1-8B-Instruct. (Hugging Face)

2) Know the accepted dataset shapes

SFTTrainer accepts either:

  • single-column: {"text": "...final rendered conversation..."}, or
  • two-column: {"prompt": "...", "completion": "..."}.
    AutoTrain commonly uses the single text column for chat SFT. (Hugging Face)

3) Render your triples into one training string

  • Build messages: user = instruction + ("\n\n" + input if present); assistant = output.
  • Render with the tokenizer’s chat template: apply_chat_template(messages, tokenize=False, add_generation_prompt=False).
  • Save one column named text. (Hugging Face)

4) Minimal preprocessing code

from datasets import load_dataset
from transformers import AutoTokenizer
import pandas as pd

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def render_row(r):
    user = r["instruction"] + (("\n\n" + r["input"]) if r.get("input") else "")
    messages = [{"role":"user","content":user},
                {"role":"assistant","content":r["output"]}]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

ds = load_dataset("tatsu-lab/alpaca", split="train")  # replace with your data
df = pd.DataFrame({"text": [render_row(x) for x in ds]})
df.to_csv("autotrain_llm_sft.csv", index=False)

apply_chat_template ensures the exact prompt tokens and headers the model expects. (Hugging Face)

5) Create the AutoTrain job

UI: upload CSV/JSONL, set Column Mapping → text → text, choose LLM finetuning → SFT. (Hugging Face)
CLI (reliable, explicit):

pip install autotrain-advanced

autotrain llm \
  --train \
  --project-name llama31-alpaca-sft \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --data-path ./ \
  --train-split train \
  --text-column text \
  --trainer sft \
  --use-peft \
  --lora-r 16 --lora-alpha 32 --lora-dropout 0.05 \
  --batch-size 4 --gradient-accumulation 8 \
  --lr 2e-4 --epochs 3 --bf16 \
  --max-seq-length 4096

Flags mirror documented AutoTrain usage. Adjust batch and GA for VRAM. (Hugging Face)

6) Inference must match training

At generation, build messages and call the same tokenizer’s chat template to format the prompt before generate. Template mismatches degrade outputs. Llama 3.1 has known header nuances; verify your output. (Hugging Face)

7) When you’d use more columns

Only if you pick a different trainer or format:

  • Prompt+completion SFT: map prompt and completion. (Hugging Face)
  • DPO/ORPO: needs prompt, chosen, rejected. AutoTrain exposes those roles in column mapping. (Hugging Face)

8) Quick checks

  • Open one CSV row. Confirm it contains the full rendered conversation string. (Hugging Face)
  • If UI mapping is unclear, switch to CLI and set --text-column text. (Hugging Face)
  • If outputs look odd, print a rendered example, confirm chat headers match the model card’s template. (Llama)

References

AutoTrain LLM finetuning and column mapping, TRL SFT dataset formats, and chat templating docs. (Hugging Face)

For SFT and its practical implementation, the Smol course provides a concise overview of the entire process, so I recommend giving it a quick read.

Hi @John6666 ,

Great explanation and these are wonderful links. I’m feel like enlightened. Even I started to following that smol course.

Thank you,
Yunus :hugs:

1 Like

Welcome! :blush: You’re on the right track. Hugging Face AutoTrain does support fine-tuning instruction-style datasets like Alpaca, but it’s a bit limited compared to manual training.

  • For datasets with instruction / input / output, the standard approach is to merge instruction + input into a single prompt column, and keep output as the label. AutoTrain usually expects just one “text” and one “label/output” field.

  • If the UI only shows one mapping field, you’ll need to preprocess your dataset before uploading (e.g., combine instruction + input into a new prompt column).

  • For full control, many people skip AutoTrain and instead use the Hugging Face trl library (SFTTrainer) with LoRA. This gives you more flexibility for instruction-tuning LLaMA models.

Docs to check:

  • Fine-tuning with TRL

  • AutoTrain docs

So TL;DR: preprocess into 2 columns (prompt, output), then upload to AutoTrain, or use trl for more advanced setups.

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.