Instructions to use ModernVBERT/colmodernvbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use ModernVBERT/colmodernvbert with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ColModernVBERT
Model
This is the model card for ColModernVBERT, the late-interaction version of ModernVBERT that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
Table of Contents
Overview
The ModernVBERT suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.
For more information about ModernVBERT, please check the arXiv preprint.
Models
ColModernVBERTis the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.BiModernVBERTis the bi-encoder version that is fine-tuned for visual document retrieval tasks.ModernVBERT-embedis the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.ModernVBERTis the base model after modality alignment (using a MLM objective).
Usage
🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:
For now, the branch for using colmdernvbert is not yet merged in the official colpali repo, you need to clone the repo and checkout on the right branch to use it.
pip install colpali-engine
Here is an example of masked token prediction using ModernVBERT:
import torch
from colpali_engine.models import ColModernVBert, ColModernVBertProcessor
from PIL import Image
from huggingface_hub import hf_hub_download
model_id = "ModernVBERT/colmodernvbert"
processor = ColModernVBertProcessor.from_pretrained(model_id)
model = ColModernVBert.from_pretrained(
model_id,
torch_dtype=torch.float32, # use torch_dtype=torch.bfloat16 for flash attention
trust_remote_code=True
)
image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
text = "This is a text"
# Prepare inputs
text_inputs = processor.process_texts([text])
image_inputs = processor.process_images([image])
# Inference
q_embeddings = model(**text_inputs)
corpus_embeddings = model(**image_inputs)
# Get the similarity scores
scores = processor.score(q_embeddings, corpus_embeddings)
print("Similarity scores:", scores)
Evaluation
License
We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.
Citation
If you use ModernVBERT in your work, please cite:
@misc{teiletche2025modernvbertsmallervisualdocument,
title={ModernVBERT: Towards Smaller Visual Document Retrievers},
author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
year={2025},
eprint={2510.01149},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2510.01149},
}
- Downloads last month
- 1,892