nvidia
/

PixelDiT-1300M-1024px

image-generation

Model card Files Files and versions

yongshengy commited on Apr 15

Commit

7c63b99

·

verified ·

1 Parent(s): 47ac20c

Update README.md

Files changed (1) hide show

README.md +4 -25

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ pipeline_tag: text-to-image
 ---
 <p align="center">
-  <img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="120" />
 </p>
 <h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
@@ -40,26 +40,13 @@ pipeline_tag: text-to-image
   <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
 </p>
-## Model Overview
-**PixelDiT-T2I** (1.3B parameters) is a text-to-image generation model that operates directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details, and MM-DiT blocks for text-image fusion.
-This checkpoint supports generation at **512x512** and **1024x1024** resolution with multi-aspect-ratio support.
 ### Key Features
-- **VAE-free**: Generates images directly in pixel space, eliminating VAE-induced artifacts
 - **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
 - **MM-DiT text-image fusion**: Joint attention between text and image tokens
 - **Text encoder**: Gemma-2-2B-IT
-- **Multi-aspect-ratio**: Supports various aspect ratios at 512px and 1024px
-## Performance
-| Resolution | GenEval | DPG-Bench |
-|:---:|:---:|:---:|
-| 512x512 | 0.78 | 83.7 |
-| 1024x1024 | 0.74 | 83.5 |
 ## Usage
@@ -111,18 +98,10 @@ python inference.py \
 | Text max length | 300 |
 | Text encoder | Gemma-2-2B-IT |
-## Training
-The model was trained in three stages:
-1. **Stage 1** — Pre-train at 512x512, fixed resolution, with REPA loss
-2. **Stage 2** — Fine-tune at 512x512 with multi-aspect-ratio, no REPA loss
-3. **Stage 3** — Fine-tune at 1024x1024 with multi-aspect-ratio, no REPA loss
 ## Citation
 ```bibtex
-@inproceedings{yu2025pixeldit,
       title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
       author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
       booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

 ---
 <p align="center">
+  <img src="https://raw.githubusercontent.com/NVlabs/PixelDiT/master/assets/pixeldit-logo.png" height="60" />
 </p>
 <h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
   <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
 </p>
 ### Key Features
+- **VAE-free**
 - **Dual-level architecture**: Patch-level DiT + Pixel-level DiT
 - **MM-DiT text-image fusion**: Joint attention between text and image tokens
 - **Text encoder**: Gemma-2-2B-IT
+- **Multi-aspect-ratio**: Supports various aspect ratios at 1024px
 ## Usage
 | Text max length | 300 |
 | Text encoder | Gemma-2-2B-IT |
 ## Citation
 ```bibtex
+@inproceedings{yu2026pixeldit,
       title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
       author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
       booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},