yqi19 commited on
Commit
bc93303
·
verified ·
1 Parent(s): dd3c413

add: README with color_object checkpoint loading and inference guide

Browse files
Files changed (1) hide show
  1. README.md +86 -248
README.md CHANGED
@@ -1,152 +1,85 @@
 
1
 
 
 
2
 
3
- # 🤖 X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model
4
 
5
- | 📄 **Paper** | 🌐 **Project Page** | 🤗 **Hugging Face** |
6
- | :---: | :---: | :---: |
7
- | [Read the Full Research](https://arxiv.org/pdf/2510.10274) | [Explore the Demos](https://thu-air-dream.github.io/X-VLA/) | [Access Models & Datasets](https://huggingface.co/collections/2toINF/x-vla) |
8
-
9
-
10
- ## 🏆 Highlights & News
11
-
12
- ### 🎉 Exciting News: X-VLA Accepted to ICLR 2026
13
- We are thrilled to announce that **X-VLA has been accepted to ICLR 2026**.
14
-
15
- ### 🚀 Now Supported in LeRobot
16
- X-VLA is now natively integrated into the [LeRobot platform](https://huggingface.co/docs/lerobot/xvla).
17
- Give it a try! We sincerely appreciate the support and collaboration from the Hugging Face team.
18
-
19
- ### 🥇 Champion Winner at IROS 2025
20
- X-VLA won **1st Place (Champion)** at the **AgiBot World Challenge**, held at **IROS 2025**.
21
-
22
-
23
- ---
24
-
25
- ## 🧩 Overview
26
-
27
- Successful generalist **Vision–Language–Action (VLA)** models depend on scalable, cross-platform training across diverse robotic embodiments.
28
- To leverage the heterogeneity of large-scale robot datasets, **X-VLA** introduces a **soft prompt** mechanism — embodiment-specific learnable embeddings that guide a unified Transformer backbone toward effective multi-domain policy learning.
29
-
30
- The resulting architecture — **X-VLA-0.9B** achieves **state-of-the-art generalization** across six simulation platforms and three real-world robots, surpassing prior VLA approaches in dexterity, adaptability, and efficiency.
31
-
32
- https://github.com/user-attachments/assets/c047bac4-17c3-4d66-8036-badfab2b8c41
 
33
 
34
  ---
35
 
36
- ## 🚀 Quick Start: Installation & Deployment
37
 
38
- ### 1️⃣ Installation
39
-
40
- ```bash
41
- # Clone the repository
42
- git clone https://github.com/2toinf/X-VLA.git
43
- cd X-VLA
44
- ```
45
 
46
  ```bash
47
- # Create and activate Conda environment
48
- conda create -n XVLA python=3.10 -y
49
- conda activate XVLA
50
-
51
- # Install dependencies
52
  pip install -r requirements.txt
53
  ```
54
 
55
- or
56
-
57
- ```bash
58
- conda env create -f environment.yml
59
- conda activate xvla-stable
60
- ```
61
-
62
- ---
63
- ### 2️⃣ Deploying X-VLA for Inference
64
-
65
- X-VLA adopts a **Server–Client** architecture to separate the model environment from simulation or robot-specific dependencies.
66
- This design avoids package conflicts and supports distributed inference across GPUs, SLURM clusters, or edge devices.
67
-
68
- #### 🧠 Available Pre-trained Models
69
-
70
- - [ ] We observed a slight performance drop (around 1% across different datasets) after converting our models to the HF format, and we’re actively investigating the cause.
71
-
72
- #### 🧠 About Libero Setup and Evluation
73
-
74
- - [x] For questions about converting relative actions to absolute actions and our implementation, please first refer to issue [#2](https://github.com/2toinf/X-VLA/issues/2) and [#15](https://github.com/2toinf/X-VLA/issues/15). We have updated full preprocessing guidance [here](https://github.com/2toinf/X-VLA/blob/main/evaluation/libero/preprocess.md).
75
-
76
- #### 🔥 Update: We have released the LoRA fine-tuning code, along with checkpoints and the associated inference code.
77
-
78
- | Model ID | Embodiment | Description | Performance | Evaluation Guidance |
79
- | :------------------------------------------------------------------------------------------------- | :---------------- | :---------------------------------------------------------------------------------------------- | :--------------: | :-----------------: |
80
- | [`2toINF/X-VLA-Pt`](https://huggingface.co/2toINF/X-VLA-Pt) | Foundation | Pretrained on large-scale heterogeneous robot–vision–language datasets for general transfer. | — | — |
81
- | [`2toINF/X-VLA-AgiWorld-Challenge`](https://huggingface.co/2toINF/X-VLA-AgiWorld-Challenge) | Agibot-G1 | Fine-tuned for AgiWorld Challenge. | **Champion🥇** | - |
82
- | [`2toINF/X-VLA-Calvin-ABC_D`](https://huggingface.co/2toINF/X-VLA-Calvin-ABC_D) | Franka | Fine-tuned on CALVIN benchmark (ABC_D subset) | **4.43** | [Calvin Eval](evaluation/calvin/README.md) |
83
- | [`2toINF/X-VLA-Google-Robot`](https://huggingface.co/2toINF/X-VLA-Google-Robot) | Google Robot | Fine-tuned on large-scale Google Robot dataset | **83.5%(VM) 76.4%(VA)** | [Simpler Eval](evaluation/simpler/README.md) |
84
- | [`2toINF/X-VLA-Libero`](https://huggingface.co/2toINF/X-VLA-Libero) | Franka | Fine-tuned on LIBERO benchmark | **98.1%** | [LIBERO Eval](evaluation/libero/README.md) |
85
- | [`2toINF/X-VLA-VLABench`](https://huggingface.co/2toINF/X-VLA-VLABench) | Franka | Fine-tuned on VLABench benchmark | **51.1(score)** | [VLABench Eval](evaluation/vlabench/README.md) |
86
- | [`2toINF/X-VLA-RoboTwin2`](https://huggingface.co/2toINF/X-VLA-RoboTwin2) | Agilex | Trained on RoboTwin2 dataset for dual-arm coordinated manipulation(50 demos for each task). | **70%** | [RoboTwin2.0 Eval](evaluation/robotwin-2.0/README.md) |
87
- | [`2toINF/X-VLA-WidowX`](https://huggingface.co/2toINF/X-VLA-WidowX) | WidowX | Fine-tuned on BridgeDataV2 (Simpler benchmark). | **95.8%** | [Simpler Eval](evaluation/simpler/README.md) |
88
- | [`2toINF/X-VLA-SoftFold`](https://huggingface.co/2toINF/X-VLA-SoftFold) | Agilex | Fine-tuned on Soft-Fold Dataset. Specialized in deformable object manipulation (e.g., folding and cloth control). | cloth folding with a 100% success rate in 2 hours. | [SoftFold-Agilex](evaluation/SoftFold-Agilex/readme.md) |
89
- | LoRA Adapters | || | |
90
- | [`2toINF/X-VLA-libero-spatial-peft`](https://huggingface.co/2toINF/X-VLA-libero-spatial-peft) | Franka | Fine-tuned on LIBERO benchmark | **96.2%** | [LIBERO Eval](evaluation/libero/README.md) |
91
- | [`2toINF/X-VLA-libero-object-peft`](https://huggingface.co/2toINF/X-VLA-libero-object-peft) | Franka | Fine-tuned on LIBERO benchmark | **96%** | [LIBERO Eval](evaluation/libero/README.md) |
92
- | [`2toINF/X-VLA-libero-goal-peft`](https://huggingface.co/2toINF/X-VLA-libero-goal-peft) | Franka | Fine-tuned on LIBERO benchmark | **94.4%** | [LIBERO Eval](evaluation/libero/README.md) |
93
- | [`2toINF/X-VLA-libero-long-peft`](https://huggingface.co/2toINF/X-VLA-libero-long-peft) | Franka | Fine-tuned on LIBERO benchmark | **83.2%** | [LIBERO Eval](evaluation/libero/README.md) |
94
- | [`2toINF/X-VLA-simpler-widowx-peft`](https://huggingface.co/2toINF/X-VLA-simpler-widowx-peft) | WidowX | Fine-tuned on BridgeDataV2 (Simpler benchmark). | **66.7%** | [Simpler Eval](evaluation/simpler/README.md) |
95
-
96
- ---
97
 
98
- ## 🧩 Notes
 
99
 
100
- - All models share a consistent architecture: `configuration_xvla.py`, `modeling_xvla.py`, and unified tokenizer (`tokenizer.json`).
101
- - The **X-VLA-Pt** model is the *foundation checkpoint*, trained across multiple robot domains.
102
- - Each embodiment is fine-tuned for its respective environment while retaining cross-embodiment alignment.
103
- - Evaluation scripts (in `evaluation/`) follow a standardized format for reproducible benchmarking.
104
-
105
- ---
106
-
107
- > 📊 Performance metrics follow standard evaluation protocols detailed in the [paper](https://arxiv.org/pdf/2510.10274).
108
-
109
- ---
110
 
111
- ### 3️⃣ Launching the Inference Server
112
 
113
  ```python
114
  from transformers import AutoModel, AutoProcessor
115
- import json_numpy
116
 
117
- # Load model and processor
118
- model = AutoModel.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
119
- processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
 
 
 
 
 
120
 
121
- # Start the inference server
122
- print("🚀 Starting X-VLA inference server...")
123
  model.run(processor, host="0.0.0.0", port=8000)
124
  ```
125
 
126
- Once launched, the API endpoint is available at:
127
-
128
  ```
129
- POST http://<server_ip>:8000/act
130
  ```
131
 
132
- ---
133
-
134
- ### 4️⃣ Client Interaction & Action Prediction
135
-
136
- The client communicates via HTTP POST, sending multimodal data (vision + language + proprioception) as a JSON payload.
137
-
138
- #### Payload Structure
139
-
140
- | Key | Type | Description |
141
- | :--------------------- | :------------------------ | :---------------------------------------------------- |
142
- | `proprio` | `json_numpy.dumps(array)` | Current proprioceptive state (e.g., joint positions). |
143
- | `language_instruction` | `str` | Task instruction (e.g., "Pick up the red block"). |
144
- | `image0` | `json_numpy.dumps(array)` | Primary camera image (RGB). |
145
- | `image1`, `image2` | *optional* | Additional camera views if applicable. |
146
- | `domain_id` | `int` | Identifier for the current robotic embodiment/domain. |
147
- | `steps` | `int` | denoising steps for flow-matching based generation (e.g., 10). |
148
-
149
- #### Example Client Code
150
 
151
  ```python
152
  import requests
@@ -154,119 +87,54 @@ import numpy as np
154
  import json_numpy
155
 
156
  server_url = "http://localhost:8000/act"
157
- timeout = 5
158
 
159
- # Prepare inputs
160
- proprio = np.zeros(7, dtype=np.float32)
161
- image = np.zeros((256, 256, 3), dtype=np.uint8)
162
- instruction = "Move the gripper to the target position"
163
 
164
  payload = {
165
  "proprio": json_numpy.dumps(proprio),
166
- "language_instruction": instruction,
167
  "image0": json_numpy.dumps(image),
168
- "domain_id": 0,
169
- "steps": 10
170
  }
171
 
172
- try:
173
- response = requests.post(server_url, json=payload, timeout=timeout)
174
- response.raise_for_status()
175
- result = response.json()
176
- actions = np.array(result["action"], dtype=np.float32)
177
- print(f"✅ Received {actions.shape[0]} predicted actions.")
178
- except Exception as e:
179
- print(f"⚠️ Request failed: {e}")
180
- actions = np.zeros((30, 20), dtype=np.float32)
181
  ```
182
 
183
- #### Expected Output
184
 
185
- ```
186
- [Server] Model loaded successfully on cuda:0
187
- [Server] Listening on 0.0.0.0:8000
188
- [Client] Sending observation to server...
189
- Received 30 predicted actions.
190
- ```
191
-
192
- ---
193
 
194
- ### 5️⃣ Standardized Control Interface: EE6D
195
-
196
- To ensure consistency across embodiments, **X-VLA** adopts a unified **EE6D (End-Effector 6D)** control space.
197
-
198
- | Component | Specification | Notes |
199
- | :------------------ | :------------------------------------------------------------------------- | :-------------------------------------------- |
200
- | **Proprio Input** | Current EE6D pose (position + orientation) | Must align with training-space normalization. |
201
- | **Action Output** | Predicted target delta/absolute pose (EE6D) | Executed by downstream controller. |
202
- | **Dimensionality** | 20-D vector = 3 (EE Pos) + 6 (Rotation in 6D) + 1 (Gripper) + 10 (Padding) | |
203
- | **Single-arm Case** | If only one arm exists, pad with zeros to maintain 20D vector. | |
204
-
205
- > ⚙️ **Reference Post-processing:**
206
- >
207
- > ```python
208
- > from datasets.utils import rotate6d_to_xyz
209
- > action_final = np.concatenate([
210
- > action_pred[:3],
211
- > rotate6d_to_xyz(action_pred[3:9]),
212
- > np.array([1.0 if action_pred[9] > 0.5 else 0])
213
- > ])
214
- > ```
215
- >
216
- > When feeding proprioception to the model, apply the **inverse transformation** accordingly.
217
-
218
- ---
219
-
220
- ### 6️⃣ Reference Client Implementations
221
-
222
- Each released model includes a corresponding **reference client** under
223
- [`evaluation/<domain>/<robot>/client.py`](evaluation/) for reproducing exact deployment behaviors.
224
- We strongly recommend adapting from these clients when connecting to physical or simulated robots.
225
-
226
- ---
227
-
228
- ### 7️⃣ SLURM & Cluster Deployment
229
-
230
- For large-scale or distributed training/deployment (e.g., HPC clusters, AgiBot nodes):
231
 
232
- ```bash
233
- python -m deploy --model_path /path/to/your/model
 
 
 
234
  ```
235
 
236
- This script automatically detects SLURM environment variables, launches distributed servers, and writes connection metadata to `info.json`.
237
-
238
  ---
239
 
240
- ## ⚙️ Training / Fine-tuning on Custom Data
241
-
242
- X-VLA supports fine-tuning on new demonstrations via a modular and extensible dataset interface.
243
-
244
- ### Data Preparation Workflow
245
-
246
- 1. **Prepare Meta JSONs** — each domain has a `meta.json` listing trajectory file paths.
247
- 2. **Implement Custom Handler** — write a domain loader class with `iter_episode(traj_idx)` generator.
248
- 3. **Register Domain** — update:
249
-
250
- * `datasets/domain_handler/registry.py`
251
- * `datasets/domain_config.py`
252
-
253
- ### Example Handlers
254
-
255
- | Handler | Dataset | Description |
256
- | :------------ | :-------------------- | :---------------------------------------- |
257
- | `"lerobot"` | Agibot-Beta | Optimized for LEROBOT format |
258
- | `"h5py"` | RoboMind / Simulation | Efficient loading from `.h5` trajectories |
259
- | `"scattered"` | AGIWorld | Handles scattered trajectory storage |
260
-
261
- ---
262
-
263
- ### Launch Training with Accelerate
264
 
265
  ```bash
266
  accelerate launch \
267
  --mixed_precision bf16 \
268
  train.py \
269
- --models '2toINF/X-VLA-Pt' \
270
  --train_metas_path /path/to/meta_files.json \
271
  --learning_rate 1e-4 \
272
  --learning_coef 0.1 \
@@ -275,47 +143,17 @@ accelerate launch \
275
  --warmup_steps 2000
276
  ```
277
 
278
- | Argument | Description |
279
- | :------------------- | :------------------------------------- |
280
- | `--models` | Base model (e.g., `'2toINF/X-VLA-Pt'`) |
281
- | `--train_metas_path` | Path to meta JSON file(s) |
282
- | `--batch_size` | Batch size |
283
- | `--learning_rate` | Base LR |
284
- | `--learning_coef` | LR multiplier for soft prompts |
285
- | `--iters` | Total training iterations |
286
- | `--freeze_steps` | Steps to freeze backbone |
287
- | `--warmup_steps` | Warmup iterations |
288
 
289
  ---
290
 
291
-
292
- ## 📚 Citation
293
-
294
- If you use X-VLA in your research, please cite:
295
 
296
  ```bibtex
297
  @article{zheng2025x,
298
  title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
299
- author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
300
- and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
301
  journal = {arXiv preprint arXiv:2510.10274},
302
  year = {2025}
303
  }
304
  ```
305
-
306
- ---
307
-
308
- ## 🪪 License
309
-
310
- This repository is licensed under the **Apache License 2.0**.
311
- You may freely use, modify, and distribute the code under the terms of the license.
312
-
313
- ```
314
- Copyright 2025 2toINF (https://github.com/2toinf)
315
- Licensed under the Apache License, Version 2.0.
316
- ```
317
-
318
- ---
319
-
320
- **Maintained by [2toINF](https://github.com/2toinf)**
321
- 💬 Feedback, issues, and contributions are welcome via GitHub Discussions or Pull Requests.
 
1
+ # X-VLA -- color_object Checkpoint
2
 
3
+ X-VLA: Soft-Prompted Transformer as a Scalable Cross-Embodiment Vision-Language-Action Model.
4
+ Paper: https://arxiv.org/pdf/2510.10274
5
 
6
+ ## Repository Structure
7
 
8
+ ```
9
+ checkpoints/
10
+ color_object/
11
+ ckpt-30000/
12
+ model.safetensors # fine-tuned weights (step 30000)
13
+ config.json
14
+ tokenizer.json
15
+ tokenizer_config.json
16
+ vocab.json
17
+ merges.txt
18
+ preprocessor_config.json
19
+ special_tokens_map.json
20
+ state.json
21
+ models/ # model architecture (Florence2 + X-VLA)
22
+ configuration_florence2.py
23
+ configuration_xvla.py
24
+ modeling_florence2.py
25
+ modeling_xvla.py
26
+ processing_xvla.py
27
+ action_hub.py
28
+ transformer.py
29
+ deploy/X-VLA-Pt/ # base pretrained model config & code
30
+ evaluation/ # eval clients for Calvin, LIBERO, Simpler, etc.
31
+ slurm_scripts/ # SLURM finetune scripts for all conflict splits
32
+ train.py # full training entry point
33
+ peft_train.py # LoRA / PEFT fine-tuning entry point
34
+ deploy.py # inference server launcher
35
+ requirements.txt
36
+ ```
37
 
38
  ---
39
 
40
+ ## Loading the color_object Checkpoint and Running Inference
41
 
42
+ ### 1. Install dependencies
 
 
 
 
 
 
43
 
44
  ```bash
45
+ git clone https://huggingface.co/yqi19/xvla
46
+ cd xvla
 
 
 
47
  pip install -r requirements.txt
48
  ```
49
 
50
+ ### 2. Download the checkpoint
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ The checkpoint is already in this repo at `checkpoints/color_object/ckpt-30000/`.
53
+ To download programmatically:
54
 
55
+ ```python
56
+ from huggingface_hub import snapshot_download
57
+ snapshot_download(repo_id="yqi19/xvla", local_dir="./xvla")
58
+ ```
 
 
 
 
 
 
59
 
60
+ ### 3. Launch the inference server
61
 
62
  ```python
63
  from transformers import AutoModel, AutoProcessor
 
64
 
65
+ model = AutoModel.from_pretrained(
66
+ "checkpoints/color_object/ckpt-30000",
67
+ trust_remote_code=True,
68
+ )
69
+ processor = AutoProcessor.from_pretrained(
70
+ "checkpoints/color_object/ckpt-30000",
71
+ trust_remote_code=True,
72
+ )
73
 
 
 
74
  model.run(processor, host="0.0.0.0", port=8000)
75
  ```
76
 
77
+ The inference endpoint will be available at:
 
78
  ```
79
+ POST http://localhost:8000/act
80
  ```
81
 
82
+ ### 4. Query the server (client side)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ```python
85
  import requests
 
87
  import json_numpy
88
 
89
  server_url = "http://localhost:8000/act"
 
90
 
91
+ proprio = np.zeros(7, dtype=np.float32) # joint / EE state
92
+ image = np.zeros((256, 256, 3), dtype=np.uint8) # RGB observation
 
 
93
 
94
  payload = {
95
  "proprio": json_numpy.dumps(proprio),
96
+ "language_instruction": "Pick up the red block and place it on the green object",
97
  "image0": json_numpy.dumps(image),
98
+ "domain_id": 0, # domain id used during training
99
+ "steps": 10, # diffusion denoising steps
100
  }
101
 
102
+ response = requests.post(server_url, json=payload, timeout=10)
103
+ actions = np.array(response.json()["action"], dtype=np.float32)
104
+ print(f"Predicted actions shape: {actions.shape}") # e.g. (30, 20)
 
 
 
 
 
 
105
  ```
106
 
107
+ ### 5. Action format (EE6D)
108
 
109
+ | Component | Dims | Description |
110
+ |---|---|---|
111
+ | EE position | 3 | xyz translation |
112
+ | EE rotation | 6 | 6D rotation representation |
113
+ | Gripper | 1 | open/close binary |
114
+ | Padding | 10 | zeros (single-arm) |
115
+ | **Total** | **20** | per action step |
 
116
 
117
+ Post-processing rotation:
118
+ ```python
119
+ from datasets.utils import rotate6d_to_xyz
120
+ import numpy as np
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ action_final = np.concatenate([
123
+ action_pred[:3],
124
+ rotate6d_to_xyz(action_pred[3:9]),
125
+ np.array([1.0 if action_pred[9] > 0.5 else 0.0])
126
+ ])
127
  ```
128
 
 
 
129
  ---
130
 
131
+ ## Fine-tuning on Your Own Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
  ```bash
134
  accelerate launch \
135
  --mixed_precision bf16 \
136
  train.py \
137
+ --models checkpoints/color_object/ckpt-30000 \
138
  --train_metas_path /path/to/meta_files.json \
139
  --learning_rate 1e-4 \
140
  --learning_coef 0.1 \
 
143
  --warmup_steps 2000
144
  ```
145
 
146
+ See `finetune_readme.md` for the full data preparation guide.
 
 
 
 
 
 
 
 
 
147
 
148
  ---
149
 
150
+ ## Citation
 
 
 
151
 
152
  ```bibtex
153
  @article{zheng2025x,
154
  title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
155
+ author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and others},
 
156
  journal = {arXiv preprint arXiv:2510.10274},
157
  year = {2025}
158
  }
159
  ```