Abstract
VGT, a novel paradigm for visual generation tuning, enhances vision language models to achieve high-quality image reconstruction and generation with improved efficiency.
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
Community
VGT: Visual Generation Tuning
Unleashing Visual Generation Capabilities from Any Pretrained VLM
GenEval 0.83 | DPG-Bench 81.28 | 20× Faster Convergence
arXiv: https://arxiv.org/abs/2511.23469
GitHub: https://github.com/hustvl/VGT?tab=readme-ov-file
The Core Problem
Our Solution
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction (2025)
- DINO-Tok: Adapting DINO for Visual Tokenizers (2025)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- Adapting Self-Supervised Representations as a Latent Space for Efficient Generation (2025)
- Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks (2025)
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models (2025)
- Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper

