A new benchmark released on April 15, 2026, exposes a critical weakness in frontier AI models: extended reasoning over long horizons. The LongCoT (Long-Horizon Chain-of-Thought) benchmark shows that even the most advanced models achieve less than 10% accuracy on problems requiring tens to hundreds of thousands of reasoning tokens, with GPT 5.2 scoring 9.8% and Gemini 3 Pro reaching just 6.1%.
LongCoT Tests Extended Reasoning Across 2,500 Expert-Designed Problems
The benchmark comprises 2,500 problems spanning chemistry, mathematics, computer science, chess, and logic. Unlike traditional benchmarks, LongCoT focuses on long-horizon planning where models must navigate interdependent reasoning steps while maintaining consistency across extended chains. Each problem features a short input with a verifiable answer, but solving requires navigating a complex graph of reasoning steps.
What makes LongCoT particularly revealing is its design: each individual reasoning step is within the capability of frontier models. This means failures don't reflect an inability to solve individual steps, but rather a fundamental limitation in long-horizon planning and maintaining coherence across extended reasoning chains.
Frontier Models Struggle Despite Individual Step Competence
The benchmark results highlight a substantial gap in current AI capabilities:
- GPT 5.2 achieved 9.8% accuracy
- Gemini 3 Pro reached 6.1% accuracy
- Problems require tens to hundreds of thousands of reasoning tokens
- Each local reasoning step is individually tractable for these models
- Failures isolate long-horizon planning limitations rather than single-step difficulty
Evaluation Uses Universal-PRM-7B Process Reward Model
Researchers employed Universal-PRM-7B, a process reward model capable of assessing LongCoT reasoning traces with support for input lengths up to 32,768 tokens. The evaluation methodology draws on representative reasoning benchmarks including MATH-500, AIME 2024, TheoremQA, and MMLU-Pro-1k to ensure comprehensive coverage of reasoning domains.
Why Long-Horizon Reasoning Matters for AI Deployment
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. Long chain-of-thought capability enables more free-form and exploratory reasoning structures, allowing models to explore different paths, backtrack, and correct errors. The current sub-10% accuracy reveals that even frontier models struggle with extended planning, highlighting a key area for improvement as AI systems take on more sophisticated real-world tasks.
The research community is actively investigating reinforcement learning methods, structural properties of effective reasoning chains, and training approaches to develop these capabilities. LongCoT provides a scalable benchmark for tracking progress in this critical area.
Key Takeaways
- LongCoT benchmark reveals frontier models score under 10% on long-horizon reasoning tasks, with GPT 5.2 at 9.8% and Gemini 3 Pro at 6.1%
- The benchmark comprises 2,500 expert-designed problems requiring tens to hundreds of thousands of reasoning tokens across chemistry, mathematics, computer science, chess, and logic
- Individual reasoning steps are within model capabilities, isolating the challenge as long-horizon planning rather than single-step difficulty
- Long-horizon reasoning is critical for complex autonomous AI deployments, enabling backtracking and error correction
- The benchmark provides a scalable measure to track progress in extended reasoning capabilities as the research community develops new training approaches