ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark Paper • 2501.01290 • Published Jan 2, 2025 • 1
Nemotron-Terminal Collection We are releasing Nemotron-Terminal models and training datasets. • 5 items • Updated 3 days ago • 26
Endless Terminals: Scaling RL Environments for Terminal Agents Paper • 2601.16443 • Published Jan 23 • 18
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents Paper • 2602.07274 • Published 28 days ago • 206
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers Paper • 2602.18292 • Published 14 days ago • 10
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 176
view article Article Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective Jan 27 • 63
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation • 10 items • Updated 20 days ago • 14
Toward Efficient Agents: Memory, Tool learning, and Planning Paper • 2601.14192 • Published Jan 20 • 56
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering Paper • 2507.11527 • Published Jul 15, 2025 • 35
view article Article The Agent Era Is Here: A Comprehensive Survey of Large Language Model Agents Apr 8, 2025 • 3
TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use Paper • 2510.04550 • Published Oct 6, 2025 • 2