SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models
Abstract
SCALE is a novel inference strategy for Vision-Language-Action models that jointly modulates visual perception and action based on self-uncertainty, improving robustness without additional training or multiple forward passes.
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.
Community
We tackle test-time robustness of VLA models without additional training or multiple forward passes, by proposing SCALE: jointly modulate visual attention and action decoding based on self-uncertainty.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation (2026)
- VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models (2026)
- On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning (2026)
- LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries (2026)
- Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement (2026)
- DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models (2026)
- ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper