Ego2Robot – Convert egocentric video to robot training data

Note: The Show and Tell category is for sharing and discussing projects, showcasing your Spaces, Models, Datasets and more. We value open-source and technical details over promotional content, so focus on sharing the intricate aspects of your work.

I built Ego2Robot over the past two weeks.

What it is: An open-source pipeline that converts egocentric human video (like factory workers, warehouse operations) into robot-compatible training datasets. Think: 10,000 hours of existing video → robot foundation model pretraining data.

The problem I’m solving: Robot foundation models (like Physical Intelligence’s π₀) need diverse training data, but collecting robot demonstrations costs $100-500/hour. Meanwhile, there are thousands of hours of human work already captured on video (Egocentric-10K has 10,000 hours from 85 factories). The gap is tooling to convert it into usable formats.

What’s different: Most robotics datasets are manually collected and annotated. This pipeline is fully automated:

  • Quality filtering (motion + hand detection) reduces 433s video → 60s of useful manipulation
  • Semantic extraction using VideoMAE + CLIP (zero-shot, no fine-tuning)
  • Unsupervised skill discovery (found 10 distinct manipulation patterns)
  • Exports to LeRobot v3 format (standard for robot learning)

Technical stack: PyTorch, Transformers (VideoMAE, CLIP), MediaPipe, OpenCV, scikit-learn. Uses only pretrained models - no training from scratch.

You can try it here:

What I learned: Foundation models (VideoMAE, CLIP) transfer surprisingly well to industrial footage despite being trained on general video/images. Automated quality filtering is critical - you can’t scale if you’re manually reviewing every frame.

Happy to answer questions about the technical approach, design decisions, or where this goes next!

1 Like