FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Abstract
A file-system-based dual-agent framework enables large language model agents to perform extended research tasks beyond context window limitations by using persistent storage as external memory.
Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at https://github.com/Ignoramus0817/FS-Researcher.
Community
A deep research agent with file system as the scaling substrate, allowing external and persistent context. Although achieving maximal performance on downstream tasks still requires a lot of task-specific design at this stage, we believe that a file system has the potential to become a standard component of LLM agents.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation (2026)
- IDRBench: Interactive Deep Research Benchmark (2026)
- NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents (2025)
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing (2026)
- DR-Arena: an Automated Evaluation Framework for Deep Research Agents (2026)
- InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents (2026)
- LongDA: Benchmarking LLM Agents for Long-Document Data Analysis (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper