VLA models (Post-Training Isaac GR00T N1.5)

Hello everyone,

Post-Training Isaac GR00T N1.5, is it possible to train custom robot and custom real world dataset?

Thanks in advance

1 Like

Seems possible?


You can post-train Isaac GR00T N1.5 on a custom robot using your own real-world dataset. NVIDIA’s public model card states N1.5 is adaptable via post-training; the Hugging Face tutorial shows a complete run on a new embodiment; LeRobot’s docs explain the dataset format, processors, and the GR00T policy integration. (Hugging Face)

What this actually means

  • Custom robot = new embodiment. You describe your robot’s observations and actions, then fine-tune with embodiment_tag="new_embodiment" so GR00T learns that interface. This is the documented path when your hardware wasn’t in pretraining. (GitHub)
  • Custom dataset = LeRobot format. Record or convert your demos to LeRobotDataset v3 (Parquet + MP4, plus meta/ JSON). You can stream from the Hub or load locally. (Hugging Face)
  • Policy I/O. Inputs are camera frames + proprio + a text instruction; outputs are continuous-valued action vectors you scale and send to your controller. (Hugging Face)

Background, fast

  • GR00T N1.5 is a vision-language-action policy: VLM encoders + a flow-matching transformer that predicts action chunks conditioned on vision, language, and state. Designed for cross-embodiment adaptation via post-training. License flagged non-commercial for the public 3B checkpoint. (Hugging Face)
  • LeRobot is the training/runtime scaffold: unified dataset API, processors that map robots ↔ datasets, and a maintained GR00T N1.5 policy integration. (Hugging Face)

End-to-end recipe (clear and explicit)

1) Choose a base and verify terms

  • Model: nvidia/GR00T-N1.5-3B. Confirm “ready for non-commercial use.” (Hugging Face)

2) Record or port your data

  • Use LeRobot v3 tools to record directly to Hub/local, or convert an existing set to v3. v3 uses Parquet for state/action and MP4 for video, with meta/ describing schema, FPS, and episode offsets. Supports StreamingLeRobotDataset to train without downloading. (Hugging Face)
  • Example starter datasets you can imitate for structure: SO-101 pick-place, SO-100 pick-place, DROID v1.0.1 (LeRobot ports). (Hugging Face)

3) Describe your embodiment

  • Add meta/modality.json to your dataset. Copy an example then edit camera names, state keys, and action dims. In the official tutorial this is Step 1.2. For new robots, set embodiment_tag to new_embodiment. (Hugging Face)
  • GitHub issues and docs confirm this tag is required when your embodiment wasn’t in pretrain. (GitHub)

4) Fine-tune

  • The HF tutorial provides a runnable command (scripts/gr00t_finetune.py). It notes ~25 GB VRAM for defaults and shows flags to reduce memory if needed. (Hugging Face)
# Fine-tune GR00T N1.5 on your LeRobot v3 dataset
# refs:
#  blog: https://huggingface.co/blog/nvidia/gr00t-n1-5-so101-tuning
#  repo: https://github.com/NVIDIA/Isaac-GR00T
python scripts/gr00t_finetune.py \
  --dataset-path /data/my_robot_v3_dataset \
  --num-gpus 1 \
  --output-dir ./checkpoints/my_robot_n1p5 \
  --max-steps 10000 \
  --data-config so100_dualcam \
  --video-backend torchvision_av

5) Evaluate and deploy

  • Use the tutorial’s open-loop eval and inference server + client scripts. Map the model’s action vector to your controller API (ROS2 or vendor SDK). Keep units and bounds consistent. (Hugging Face)

Data and format details you must get right

  • LeRobot v3 layout: meta/info.json (schema, fps), meta/stats.json (norm stats), meta/episodes/ (episode offsets), data/ Parquet shards, videos/ per-camera MP4 shards. Episode views are reconstructed from metadata. (Hugging Face)
  • Loading/streaming: LeRobotDataset(...) for cached local; StreamingLeRobotDataset(...) for on-the-fly training. Both return dicts with keys like observation.images.front, observation.state, action. (Hugging Face)
  • Processors: LeRobot processors define the glue between your hardware and dataset keys; start from the official “Processors for Robots and Teleoperators.” (Hugging Face)

Known pitfalls and fixes

  • Parquet/MP4 vs expected layout: Some users hit loader errors if the dataset layout doesn’t match the pipeline’s expectations. Align your modality.json keys and verify the v3 loader version. (GitHub)
  • Large action spaces: Reports of overflow/instability at the start of training when action_dim is large; restarts or stabilization recipes help. Monitor loss/grad norms. (GitHub)
  • Camera name mismatches: If your data uses tip but the example uses wrist, update modality.json and processors accordingly. Community guides show concrete edits. (Zenn)

Evidence that this path works

  • Official tutorial post-trains N1.5 on SO-101 from teleop demos, including dataset prep, finetune, eval, deploy. (Hugging Face)
  • Public fine-tunes: Dozens of community N1.5 checkpoints on Hugging Face confirm the workflow is repeatable on varied tasks and rigs. (Hugging Face)

Minimal working example (data load → fine-tune)

# pip install "lerobot>=0.4.0"  # docs: https://huggingface.co/docs/lerobot
# refs:
#   dataset: https://huggingface.co/datasets/lerobot/svla_so101_pickplace
#   tutorial: https://huggingface.co/blog/nvidia/gr00t-n1-5-so101-tuning
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset

repo_id = "lerobot/svla_so101_pickplace"  # small, clean, GR00T-ready example
ds = StreamingLeRobotDataset(repo_id)     # streams from the Hub
sample = ds[0]                            # dict with observation.*, action

Run the tutorial’s gr00t_finetune.py with your dataset path after you validate keys and shapes. (Hugging Face)


Starter picks on Hugging Face

Models

  • nvidia/GR00T-N1.5-3B — base policy for post-training; model card explicitly mentions post-training support and shows I/O. (Hugging Face)
  • Community finetunes (reference configs, tasks, dual-cam) — browse the gr00t_n1_5 filter on the Hub. (Hugging Face)

Datasets

  • lerobot/svla_so101_pickplace and lerobot/svla_so100_pickplace — small, proven with the official tutorial and LeRobot loaders. Good for smoke tests. (Hugging Face)
  • lerobot/droid_1.0.1 — large in-the-wild manipulation demos in LeRobot format. Useful for diversity or pretraining. (Hugging Face)

Docs you will use repeatedly

  • LeRobot Dataset v3 design, directory layout, streaming API, and migration notes. (Hugging Face)
  • GR00T N1.5 policy integration page in LeRobot. (Hugging Face)

Synthetic-data booster (optional)

If you are data-limited, NVIDIA’s GR00T-Dreams blueprint shows a pipeline to generate large synthetic trajectory sets and mix them with real demos for post-training. (NVIDIA Developer)


Quick checklist

  • Base checkpoint chosen and license verified. (Hugging Face)
  • Real demos recorded or ported to LeRobot v3. (Hugging Face)
  • meta/modality.json matches your sensors, action dims, and camera names; embodiment_tag="new_embodiment". (Hugging Face)
  • Finetune with the official script; monitor VRAM and training stability. (Hugging Face)
  • Evaluate and deploy via the server/client example; map action vectors to your controller. (Hugging Face)

Short, curated references (grouped)

Official + model cards

  • GR00T N1.5 model card: post-training support, I/O, license. (Hugging Face)
  • GR00T GitHub repo (scripts, examples, “deeper understanding”). (GitHub)

Step-by-step guides

  • HF tutorial: Post-Training N1.5 on SO-101 (dataset prep, modality.json, training, eval, deploy, VRAM). (Hugging Face)
  • LeRobot: GR00T N1.5 Policy integration. (Hugging Face)
  • LeRobot: Dataset v3 docs (record, stream, migrate, directory layout). (Hugging Face)

Datasets

  • lerobot/svla_so101_pickplace, lerobot/svla_so100_pickplace, lerobot/droid_1.0.1 — ready to load. (Hugging Face)

Troubleshooting and pitfalls

  • New embodiment and tag usage clarifications. (GitHub)
  • Layout/loader mismatch with Parquet/MP4 vs expectations. (GitHub)
  • Stability issues for large action dims. (GitHub)

Thank you very much for your detail explanation :slight_smile:

1 Like