yongshengy commited on
Commit
7c63b99
·
verified ·
1 Parent(s): 47ac20c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -25
README.md CHANGED
@@ -13,7 +13,7 @@ pipeline_tag: text-to-image
13
  ---
14
 
15
  <p align="center">
16
- <img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="120" />
17
  </p>
18
 
19
  <h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
@@ -40,26 +40,13 @@ pipeline_tag: text-to-image
40
  <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
41
  </p>
42
 
43
- ## Model Overview
44
-
45
- **PixelDiT-T2I** (1.3B parameters) is a text-to-image generation model that operates directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details, and MM-DiT blocks for text-image fusion.
46
-
47
- This checkpoint supports generation at **512x512** and **1024x1024** resolution with multi-aspect-ratio support.
48
-
49
  ### Key Features
50
 
51
- - **VAE-free**: Generates images directly in pixel space, eliminating VAE-induced artifacts
52
  - **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
53
  - **MM-DiT text-image fusion**: Joint attention between text and image tokens
54
  - **Text encoder**: Gemma-2-2B-IT
55
- - **Multi-aspect-ratio**: Supports various aspect ratios at 512px and 1024px
56
-
57
- ## Performance
58
-
59
- | Resolution | GenEval | DPG-Bench |
60
- |:---:|:---:|:---:|
61
- | 512x512 | 0.78 | 83.7 |
62
- | 1024x1024 | 0.74 | 83.5 |
63
 
64
  ## Usage
65
 
@@ -111,18 +98,10 @@ python inference.py \
111
  | Text max length | 300 |
112
  | Text encoder | Gemma-2-2B-IT |
113
 
114
- ## Training
115
-
116
- The model was trained in three stages:
117
-
118
- 1. **Stage 1** — Pre-train at 512x512, fixed resolution, with REPA loss
119
- 2. **Stage 2** — Fine-tune at 512x512 with multi-aspect-ratio, no REPA loss
120
- 3. **Stage 3** — Fine-tune at 1024x1024 with multi-aspect-ratio, no REPA loss
121
-
122
  ## Citation
123
 
124
  ```bibtex
125
- @inproceedings{yu2025pixeldit,
126
  title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
127
  author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
128
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
 
13
  ---
14
 
15
  <p align="center">
16
+ <img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="60" />
17
  </p>
18
 
19
  <h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
 
40
  <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
41
  </p>
42
 
 
 
 
 
 
 
43
  ### Key Features
44
 
45
+ - **VAE-free**
46
  - **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
47
  - **MM-DiT text-image fusion**: Joint attention between text and image tokens
48
  - **Text encoder**: Gemma-2-2B-IT
49
+ - **Multi-aspect-ratio**: Supports various aspect ratios at 1024px
 
 
 
 
 
 
 
50
 
51
  ## Usage
52
 
 
98
  | Text max length | 300 |
99
  | Text encoder | Gemma-2-2B-IT |
100
 
 
 
 
 
 
 
 
 
101
  ## Citation
102
 
103
  ```bibtex
104
+ @inproceedings{yu2026pixeldit,
105
  title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
106
  author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
107
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},