Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ pipeline_tag: text-to-image
|
|
| 13 |
---
|
| 14 |
|
| 15 |
<p align="center">
|
| 16 |
-
<img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="
|
| 17 |
</p>
|
| 18 |
|
| 19 |
<h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
|
|
@@ -40,26 +40,13 @@ pipeline_tag: text-to-image
|
|
| 40 |
<a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
|
| 41 |
</p>
|
| 42 |
|
| 43 |
-
## Model Overview
|
| 44 |
-
|
| 45 |
-
**PixelDiT-T2I** (1.3B parameters) is a text-to-image generation model that operates directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details, and MM-DiT blocks for text-image fusion.
|
| 46 |
-
|
| 47 |
-
This checkpoint supports generation at **512x512** and **1024x1024** resolution with multi-aspect-ratio support.
|
| 48 |
-
|
| 49 |
### Key Features
|
| 50 |
|
| 51 |
-
- **VAE-free**
|
| 52 |
- **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
|
| 53 |
- **MM-DiT text-image fusion**: Joint attention between text and image tokens
|
| 54 |
- **Text encoder**: Gemma-2-2B-IT
|
| 55 |
-
- **Multi-aspect-ratio**: Supports various aspect ratios at
|
| 56 |
-
|
| 57 |
-
## Performance
|
| 58 |
-
|
| 59 |
-
| Resolution | GenEval | DPG-Bench |
|
| 60 |
-
|:---:|:---:|:---:|
|
| 61 |
-
| 512x512 | 0.78 | 83.7 |
|
| 62 |
-
| 1024x1024 | 0.74 | 83.5 |
|
| 63 |
|
| 64 |
## Usage
|
| 65 |
|
|
@@ -111,18 +98,10 @@ python inference.py \
|
|
| 111 |
| Text max length | 300 |
|
| 112 |
| Text encoder | Gemma-2-2B-IT |
|
| 113 |
|
| 114 |
-
## Training
|
| 115 |
-
|
| 116 |
-
The model was trained in three stages:
|
| 117 |
-
|
| 118 |
-
1. **Stage 1** — Pre-train at 512x512, fixed resolution, with REPA loss
|
| 119 |
-
2. **Stage 2** — Fine-tune at 512x512 with multi-aspect-ratio, no REPA loss
|
| 120 |
-
3. **Stage 3** — Fine-tune at 1024x1024 with multi-aspect-ratio, no REPA loss
|
| 121 |
-
|
| 122 |
## Citation
|
| 123 |
|
| 124 |
```bibtex
|
| 125 |
-
@inproceedings{
|
| 126 |
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
|
| 127 |
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
|
| 128 |
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
<p align="center">
|
| 16 |
+
<img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="60" />
|
| 17 |
</p>
|
| 18 |
|
| 19 |
<h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
|
|
|
|
| 40 |
<a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
|
| 41 |
</p>
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
### Key Features
|
| 44 |
|
| 45 |
+
- **VAE-free**
|
| 46 |
- **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
|
| 47 |
- **MM-DiT text-image fusion**: Joint attention between text and image tokens
|
| 48 |
- **Text encoder**: Gemma-2-2B-IT
|
| 49 |
+
- **Multi-aspect-ratio**: Supports various aspect ratios at 1024px
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## Usage
|
| 52 |
|
|
|
|
| 98 |
| Text max length | 300 |
|
| 99 |
| Text encoder | Gemma-2-2B-IT |
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
## Citation
|
| 102 |
|
| 103 |
```bibtex
|
| 104 |
+
@inproceedings{yu2026pixeldit,
|
| 105 |
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
|
| 106 |
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
|
| 107 |
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|