AARRI-Bench: Best AI Research Agents Only Achieve 68.3% on Real Researcher Tasks

A new benchmark reveals that even the most advanced AI agents struggle with the nuanced decision-making required in real research work. AARRI-Bench, introduced by researchers led by Jiayu Wang, shows that the best-performing configuration—Mini-SWE-Agent with Claude Opus 4.7—achieves only a 68.3% success rate on tasks requiring researcher-like judgment. The study concludes that frontier agents "remain unable to fully replace human researchers" due to significant limitations in field sensitivity, research ethics, and scientific judgment.

AARRI-Bench Tests Granular Research Behaviors

Unlike existing benchmarks that focus on macro-level execution abilities, AARRI-Bench evaluates whether agents demonstrate "the professionalism, thoroughness, and nuanced reasoning that characterize human researchers" in granular research scenarios. The benchmark emphasizes detailed, domain-specific understanding and judgment calls that real researchers make routinely, rather than broad task completion.

Key evaluation areas include:

Subtle detail recognition that human researchers find obvious
Field-specific sensitivity and domain knowledge application
Research ethics and professional standards
Nuanced scientific judgment in ambiguous situations

Agents Overlook Critical Details Despite Advanced Scaffolding

The research team found that agents "frequently overlook subtle yet critical details that are obvious to real human researchers," even when using sophisticated agent frameworks like SWE-agent. This suggests a fundamental limitation in how current AI systems process research tasks. The problem isn't merely about execution capability—it's about lacking the deep contextual understanding that characterizes professional research work.

The researchers conclude that "developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding." This indicates that better agent frameworks alone won't solve the problem; models need to learn actual research behaviors through improved training approaches.

First in Planned Benchmark Series

AARRI-Bench, which stands for "Act As a Real Research Intern," is the first in a planned AARR (Act As a Real Researcher) benchmark series. The researchers envision expanding the evaluation framework across different research career stages and specialties, creating a comprehensive assessment system for AI research capabilities. The benchmark data and evaluation framework have been released at https://github.com/AARR-bench/AARRI-bench.

Key Takeaways

The best AI research agent configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success on AARRI-Bench tasks
AI agents frequently miss subtle but critical details that human researchers readily identify
Frontier agents show significant limitations in field sensitivity, research ethics, and nuanced scientific judgment
Better agent scaffolding alone won't solve the problem—models need to learn actual research behaviors
AARRI-Bench is the first in a planned series evaluating AI across different research career stages

AARRI-Bench Tests Granular Research Behaviors

Key evaluation areas include:

Subtle detail recognition that human researchers find obvious

Field-specific sensitivity and domain knowledge application

Research ethics and professional standards

Nuanced scientific judgment in ambiguous situations

Agents Overlook Critical Details Despite Advanced Scaffolding

First in Planned Benchmark Series

Key Takeaways

The best AI research agent configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success on AARRI-Bench tasks

AI agents frequently miss subtle but critical details that human researchers readily identify

Frontier agents show significant limitations in field sensitivity, research ethics, and nuanced scientific judgment

Better agent scaffolding alone won't solve the problem—models need to learn actual research behaviors

AARRI-Bench is the first in a planned series evaluating AI across different research career stages