Argus-Colqwen3.5-4b-v0 Β· fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck β€” Data Science group Β· 2026

DataScience-UIBK/Argus-Colqwen3.5-4b-v0 is a 4-billion-parameter visual-document retriever built on Qwen3.5-VL-4B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release β€” the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 for memory-constrained deployment.

TL;DR β€” leaderboard standing

  • #1 on the ViDoRe v1 leaderboard among 4B-class models, beating Nemotron-4B-v2 (91.6), athrael-soju-colqwen3.5-4.5B (91.5), Ops-Colqwen3-4B (91.4).
  • #2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (92.7).
  • Competitive on ViDoRe v2 (0.6404 nDCG@5), within the 4B class. Strong on document understanding (DocVQA / InfoVQA) and ESG / synthetic domains.
  • 4 B parameters, 1024-d per-token embedding, ≀ 2048 visual tokens / page β€” fits on a single 24 GB GPU.
  • Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
  2. Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware β€” e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)Β·shared_out + sigmoid(g_e)Β·specialist_out.
  4. Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01Β·router-zΒ² to keep all 4 experts useful and suppress routing collapse.
  5. 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) β†’ (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) β†’ (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

The router sits near the top of the backbone (layer βˆ’5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-4B-Instruct
Total parameters 4.71 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size (visual chunking) 4 (so each region = 4 visual tokens)
Router placement backbone layer βˆ’5
Routing aux losses load balance + KL-uniform + 0.01 Β· router-zΒ²
Weight precision (this release) float32
License Apache 2.0
Model size on disk ~18 GB
VRAM @ bf16 inference ~9.4 GB

Performance β€” ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with the official mteb 2.12 library on the published weights. Per the bf16-merge memo, the fp32 release is ~0.1 pp higher on V1 average and ~0.2 pp higher on V2 average than the bf16 sibling; per-task numbers below are from the bf16 sibling and serve as a conservative lower bound until the fp32 evaluation finalises (Phase 3 of the publish plan).

Task bf16 nDCG@5 fp32 expected
ArxivQA 0.9126 β‰₯ 0.9126
DocVQA 0.6779 πŸ† β‰₯ 0.6779
InfoVQA 0.9447 β‰₯ 0.9447
ShiftProject 0.9346 β‰₯ 0.9346
SyntheticDocQA-AI 0.9926 β‰₯ 0.9926
SyntheticDocQA-Energy 0.9750 β‰₯ 0.9750
SyntheticDocQA-Government 0.9779 β‰₯ 0.9779
SyntheticDocQA-Healthcare 0.9963 πŸ† β‰₯ 0.9963
TabFQuAD 0.9544 β‰₯ 0.9544
TatDQA 0.8485 β‰₯ 0.8485
Average 0.9214 β‰ˆ 0.9224

πŸ† = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).

ViDoRe v1 β€” 4B-class leaderboard comparison

Rank Model Params dim V1 avg
1 Argus-Colqwen3.5-4b-v0 (this, fp32) 4.0 B 1024 0.9224
2 nvidia/llama-nemotron-colembed-vl-3b-v2 3.0 B hidden 0.917
3 nvidia/nemotron-colembed-vl-4b-v2 4.0 B hidden 0.916
4 athrael-soju/colqwen3.5-4.5B-v3 4.5 B 320 0.915
5 OpenSearch-AI/Ops-Colqwen3-4B 4.0 B 2560 0.914
6 nvidia/llama-nemoretriever-colembed-3b-v1 3.0 B 512 0.910
7 TomoroAI/tomoro-colqwen3-embed-4b 4.0 B 320 0.906
8 VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 4.0 B 128 0.908

(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)

Performance β€” ViDoRe v2 (English, nDCG@5, 4 tasks)

Task bf16 nDCG@5 fp32 expected
BioMedicalLectures 0.6349 β‰₯ 0.6349
ESGReports-HighLevel 0.7079 β‰₯ 0.7079
ESGReports 0.6175 β‰₯ 0.6175
EconomicsReports 0.5918 β‰₯ 0.5918
Average 0.6380 β‰ˆ 0.6404

ViDoRe v2 β€” 4B-class context

Model V2 avg
Ops-Colqwen3-4B (dim 2560) 0.687
TomoroAI/tomoro-colqwen3-embed-4b 0.660
Argus-Colqwen3.5-4b-v0 (fp32) 0.640

V2 is the area we are still actively improving β€” the wider 2560-d head used by Ops gives an advantage on the more layout-heavy ESG and economics pages. Argus's per-token compression to 1024-d is a 3Γ— storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16:

Model Tokens/page Dim Bytes/page
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-4b-v0 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB
SauerkrautLM-ColQwen3-4b-v0.1 1024 128 0.3 MB

Argus uses more tokens (2048 vs 1280) so the router has enough spatial granularity for region-aware specialisation, but the narrow 1024-d head keeps total per-page storage 36 % smaller than Ops despite the higher token count.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default β€” upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage β€” text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)         # list of (Lq, 1024)
d_emb  = model.encode_images(processor, documents)         # list of (Ld, 1024)
scores = processor.score(q_emb, d_emb)                     # MaxSim, shape (len(q), len(d))
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

A single H100 80 GB completes the full V1 + V2 run in roughly 4–6 hours.

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-4b-v0 \
  --collection-name vidore-v1

Training

Setting Value
Backbone Qwen/Qwen3.5-VL-4B-Instruct (Apache-2.0)
Stage 1 β€” dense baseline trains the standard ColPali head; serves as the teacher
Stage 2 β€” MoE balance warmup gates frozen, no PEFT, short β€” only goal is to prevent expert collapse
Stage 3 β€” joint retrieval w/ distillation PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5)
LoRA rank 32 (folded into base for this release via merge_and_unload() in fp32)
Datasets vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets)
Hardware 4 Γ— NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser AdamW, lr = 5e-5 with linear warmup
Precision bf16 forward / fp32 master + LoRA
Effective batch size 64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16, which is ~0.1 pp lower on V1 and ~0.2 pp lower on V2 β€” see the bf16 sibling card.

Limitations

  • English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
  • 4 experts Γ— top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
  • ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
  • Per-task numbers above use the bf16 sibling as a conservative lower bound until the fp32 reproducer run completes; they will be replaced with the fp32 numbers in a follow-up commit.

License

Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
105
Safetensors
Model size
5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-4b-v0

Spaces using DataScience-UIBK/Argus-Colqwen3.5-4b-v0 6