Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
Abstract
Generalizable Predictive Prompt Selection (GPS) uses Bayesian inference with a lightweight generative model to efficiently select informative prompts for reinforcement learning-enhanced language models, improving training efficiency and performance.
Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.
Community
This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models (2026)
- Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification (2026)
- Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models (2026)
- AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards (2025)
- MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop (2026)
- Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning (2026)
- From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper