SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks Paper • 2603.24755 • Published Mar 25 • 30
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning Paper • 2603.09160 • Published Mar 10 • 17
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training Paper • 2505.00358 • Published May 1, 2025 • 26
Shrinking the Generation-Verification Gap with Weak Verifiers Paper • 2506.18203 • Published Jun 22, 2025 • 2
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation Paper • 2506.10403 • Published Jun 12, 2025 • 2
The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators Paper • 2407.11004 • Published Jun 25, 2024
ScriptoriumWS: A Code Generation Assistant for Weak Supervision Paper • 2502.12366 • Published Feb 17, 2025