Researchers from multiple institutions have introduced ClawBench, a new evaluation framework that exposes a stark gap between AI agents' performance on controlled benchmarks versus real-world tasks. The benchmark tests whether AI agents can complete 153 realistic everyday online tasks across 144 live platforms, with the best-performing model achieving only a 33.3% success rate.
ClawBench Tests AI Agents on Live Production Websites
Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity and dynamic nature of real-world web interaction. The framework uses a lightweight interception layer that captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Tasks span 15 categories and include completing purchases, booking appointments, submitting job applications, obtaining relevant information from user-provided documents, and filling detailed forms correctly.
Frontier Models Show Dramatic Performance Drop on Real-World Tasks
The researchers evaluated seven frontier models, with Claude Sonnet 4.6 achieving the highest success rate at only 33.3%. This represents a massive decline from sandbox benchmark performance, where models typically score around 70%. The results highlight that both proprietary and open-source models can complete only a small portion of real-world tasks, with some models achieving success rates as low as 6.5%.
Real-World Complexity Exposes Limitations of Current Benchmarks
The performance gap reveals that sandbox benchmarks have been misleading about AI agents' practical capabilities. Real-world tasks require dealing with messy user interfaces, edge cases, and dynamic content that static benchmarks cannot replicate. The researchers argue that progress on ClawBench brings the field closer to AI agents that can function as reliable general-purpose assistants, rather than systems that excel only in controlled environments.
Key Takeaways
- ClawBench evaluates AI agents on 153 everyday tasks across 144 live production websites, testing real-world capabilities
- Claude Sonnet 4.6 achieved the highest success rate at only 33.3%, with some models scoring as low as 6.5%
- Performance represents a dramatic drop from sandbox benchmarks where models typically achieve ~70% success rates
- The framework uses a lightweight interception layer to safely test agents on live websites without causing real-world side effects
- Results demonstrate that current AI agents struggle with messy UIs, edge cases, and dynamic content present in real-world applications