Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
Abstract
A distribution-aware reinforcement learning framework improves multimodal large language models' numerical regression performance on long-tailed distributions through batch-level comparison-based supervision.
Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
Community
Identifies the limitations of supervised fine-tuning for numerical regression, and introduces GRPO with
batch-aware supervision for deep imbalanced regression together with a newly constructed DIR benchmark.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression (2026)
- REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge (2026)
- S-GRPO: Unified Post-Training for Large Vision-Language Models (2026)
- Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL (2026)
- GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification (2026)
- Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge (2026)
- Structured Role-Aware Policy Optimization for Multimodal Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.01402 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper