Claw-Eval-Live: Agent Benchmark with Refreshable Task Signal from Real-World Workflow Demand

Researchers have introduced Claw-Eval-Live, a new agent evaluation framework that separates refreshable signal layers updated from real-world workflow demand from reproducible, time-stamped release snapshots. Published to arXiv on April 30, 2026, the benchmark reveals that the leading model passes only 66.7% of tasks, with no model reaching 70%, suggesting reliable workflow automation remains far from solved.

Benchmark Architecture Addresses Static Task Set Limitations

Many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. Claw-Eval-Live addresses this by constructing releases from public workflow-demand signals, currently using ClawHub Top-500 skills, while materializing each release as controlled tasks with fixed fixtures, services, workspaces, and graders. The current release comprises 105 tasks spanning controlled business services including HR, management, multi-system workflows, and local workspace repair.

Execution Verification Replaces Response-Only Evaluation

Unlike benchmarks that accept agent responses at face value, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts. The grading methodology uses deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions, requiring actual execution in controlled environments with artifact inspection. This verification rigor contrasts with approaches that only evaluate final responses without verifying actual task completion.

Leading Models Show Significant Performance Gaps Across Task Families

Evaluation of 13 frontier models under a shared public pass rule reveals persistent bottlenecks in HR, management, and multi-system business workflows, while local workspace repair tasks prove comparatively easier but remain unsaturated. The research identifies that leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, with task-level discrimination concentrating in a middle band of tasks rather than distributing evenly across difficulty levels.

Live Evaluation Methodology Enables Continuous Benchmark Updates

Beyond static benchmarking, the methodology enables continuous updates with each release constructed from public workflow-demand signals. This approach grounds workflow-agent evaluation twice: in fresh external demand reflecting what users actually need automated, and in verifiable agent action through controlled execution environments. The framework suggests that future agent evaluation should move beyond frozen task sets toward dynamic assessment aligned with evolving real-world automation requirements.

Key Takeaways

Claw-Eval-Live separates refreshable workflow-demand signals from time-stamped release snapshots, addressing limitations of static benchmarks
The benchmark evaluates 13 frontier models across 105 tasks, with the leading model passing only 66.7% and no model reaching 70%
Grading verifies actual execution through traces, audit logs, service state, and workspace artifacts rather than evaluating responses alone
Persistent bottlenecks appear in HR, management, and multi-system business workflows, while local workspace repair proves comparatively easier
The methodology enables continuous benchmark updates constructed from public workflow-demand signals rather than frozen task sets

Benchmark Architecture Addresses Static Task Set Limitations

Execution Verification Replaces Response-Only Evaluation

Leading Models Show Significant Performance Gaps Across Task Families

Live Evaluation Methodology Enables Continuous Benchmark Updates

Key Takeaways

Claw-Eval-Live separates refreshable workflow-demand signals from time-stamped release snapshots, addressing limitations of static benchmarks

The benchmark evaluates 13 frontier models across 105 tasks, with the leading model passing only 66.7% and no model reaching 70%

Grading verifies actual execution through traces, audit logs, service state, and workspace artifacts rather than evaluating responses alone

Persistent bottlenecks appear in HR, management, and multi-system business workflows, while local workspace repair proves comparatively easier

The methodology enables continuous benchmark updates constructed from public workflow-demand signals rather than frozen task sets