AgentIR: Reasoning-Aware Retrieval Achieves 68% Accuracy for Deep Research Agents

Researchers have developed AgentIR, a reasoning-aware retrieval system designed specifically for AI research agents that achieves 68% accuracy on the BrowseComp-Plus benchmark—an 18 percentage point improvement over conventional embedding models. Published March 4, 2026 on arXiv, the system jointly embeds an agent's reasoning trace alongside its query, exploiting the rich contextual information that agents naturally generate but existing retrievers ignore.

Paradigm Shift From Human-Centric to Agent-Centric Retrieval

Deep research agents differ fundamentally from human users in how they search. While humans issue and refine queries without documenting their thought processes, agents generate explicit natural language reasoning before each search call. This reasoning reveals search intent, context from previous searches, intermediate conclusions, and specific information gaps the agent is trying to fill—signals that conventional retrieval systems completely discard.

AgentIR introduces reasoning-aware retrieval as a new paradigm that treats this agent-generated reasoning as a first-class input. The system jointly embeds both the query and the full reasoning trace, allowing the retriever to understand not just what the agent is searching for, but why it needs that information and how it fits into the broader research task.

Technical Architecture and Performance

The research team—Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, and Akari Asai, and Victor Zhong—developed two core components:

Reasoning-Aware Retrieval: A retrieval architecture that jointly processes agent reasoning traces and queries
DR-Synth: A data synthesis method that generates training data for deep research retrievers from standard QA datasets

Benchmark results demonstrate substantial gains:

AgentIR-4B with Tongyi-DeepResearch agent: 68% accuracy on BrowseComp-Plus
Conventional embedding models (twice the size): 50% accuracy
BM25 baseline: 37% accuracy
Net improvement: 31 percentage points over keyword-based retrieval

Both components proved independently effective, with their combination yielding the trained AgentIR-4B embedding model. The research team has released code and data at https://texttron.github.io/AgentIR/.

Implications for Agent Infrastructure

As AI agents become primary consumers of retrieval systems rather than humans, this research signals a broader shift toward agent-native infrastructure. The substantial accuracy improvements demonstrate that systems purpose-built for agent workflows can dramatically outperform adapted human-centric tools. The approach generalizes beyond research agents to any AI system that generates reasoning traces before information-seeking actions, including customer service agents, coding assistants, and analytical systems.

Key Takeaways

AgentIR achieves 68% accuracy on BrowseComp-Plus, outperforming conventional embeddings by 18 percentage points and BM25 by 31 points
The system jointly embeds agent reasoning traces with queries, exploiting contextual signals that existing retrievers ignore
DR-Synth synthesizes training data for deep research retrievers from standard QA datasets
Research demonstrates a paradigm shift toward building retrieval systems specifically for AI agents rather than humans
Code and data are publicly available at https://texttron.github.io/AgentIR/

Paradigm Shift From Human-Centric to Agent-Centric Retrieval

Technical Architecture and Performance

The research team—Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, and Akari Asai, and Victor Zhong—developed two core components:

Reasoning-Aware Retrieval: A retrieval architecture that jointly processes agent reasoning traces and queries

DR-Synth: A data synthesis method that generates training data for deep research retrievers from standard QA datasets

Benchmark results demonstrate substantial gains:

AgentIR-4B with Tongyi-DeepResearch agent: 68% accuracy on BrowseComp-Plus

Conventional embedding models (twice the size): 50% accuracy

BM25 baseline: 37% accuracy

Net improvement: 31 percentage points over keyword-based retrieval

Implications for Agent Infrastructure

Key Takeaways

AgentIR achieves 68% accuracy on BrowseComp-Plus, outperforming conventional embeddings by 18 percentage points and BM25 by 31 points

The system jointly embeds agent reasoning traces with queries, exploiting contextual signals that existing retrievers ignore

DR-Synth synthesizes training data for deep research retrievers from standard QA datasets

Research demonstrates a paradigm shift toward building retrieval systems specifically for AI agents rather than humans

Code and data are publicly available at https://texttron.github.io/AgentIR/