LongCoT Benchmark Reveals Frontier Models Score Under 10% on Long-Horizon Reasoning

A new benchmark released on April 15, 2026, exposes a critical weakness in frontier AI models: extended reasoning over long horizons. The LongCoT (Long-Horizon Chain-of-Thought) benchmark shows that even the most advanced models achieve less than 10% accuracy on problems requiring tens to hundreds of thousands of reasoning tokens, with GPT 5.2 scoring 9.8% and Gemini 3 Pro reaching just 6.1%.

LongCoT Tests Extended Reasoning Across 2,500 Expert-Designed Problems

The benchmark comprises 2,500 problems spanning chemistry, mathematics, computer science, chess, and logic. Unlike traditional benchmarks, LongCoT focuses on long-horizon planning where models must navigate interdependent reasoning steps while maintaining consistency across extended chains. Each problem features a short input with a verifiable answer, but solving requires navigating a complex graph of reasoning steps.

What makes LongCoT particularly revealing is its design: each individual reasoning step is within the capability of frontier models. This means failures don't reflect an inability to solve individual steps, but rather a fundamental limitation in long-horizon planning and maintaining coherence across extended reasoning chains.

Frontier Models Struggle Despite Individual Step Competence

The benchmark results highlight a substantial gap in current AI capabilities:

GPT 5.2 achieved 9.8% accuracy
Gemini 3 Pro reached 6.1% accuracy
Problems require tens to hundreds of thousands of reasoning tokens
Each local reasoning step is individually tractable for these models
Failures isolate long-horizon planning limitations rather than single-step difficulty

Evaluation Uses Universal-PRM-7B Process Reward Model

Researchers employed Universal-PRM-7B, a process reward model capable of assessing LongCoT reasoning traces with support for input lengths up to 32,768 tokens. The evaluation methodology draws on representative reasoning benchmarks including MATH-500, AIME 2024, TheoremQA, and MMLU-Pro-1k to ensure comprehensive coverage of reasoning domains.

Why Long-Horizon Reasoning Matters for AI Deployment

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. Long chain-of-thought capability enables more free-form and exploratory reasoning structures, allowing models to explore different paths, backtrack, and correct errors. The current sub-10% accuracy reveals that even frontier models struggle with extended planning, highlighting a key area for improvement as AI systems take on more sophisticated real-world tasks.

The research community is actively investigating reinforcement learning methods, structural properties of effective reasoning chains, and training approaches to develop these capabilities. LongCoT provides a scalable benchmark for tracking progress in this critical area.

Key Takeaways

LongCoT benchmark reveals frontier models score under 10% on long-horizon reasoning tasks, with GPT 5.2 at 9.8% and Gemini 3 Pro at 6.1%
The benchmark comprises 2,500 expert-designed problems requiring tens to hundreds of thousands of reasoning tokens across chemistry, mathematics, computer science, chess, and logic
Individual reasoning steps are within model capabilities, isolating the challenge as long-horizon planning rather than single-step difficulty
Long-horizon reasoning is critical for complex autonomous AI deployments, enabling backtracking and error correction
The benchmark provides a scalable measure to track progress in extended reasoning capabilities as the research community develops new training approaches

LongCoT Tests Extended Reasoning Across 2,500 Expert-Designed Problems

Frontier Models Struggle Despite Individual Step Competence

The benchmark results highlight a substantial gap in current AI capabilities:

GPT 5.2 achieved 9.8% accuracy

Gemini 3 Pro reached 6.1% accuracy

Problems require tens to hundreds of thousands of reasoning tokens

Each local reasoning step is individually tractable for these models

Failures isolate long-horizon planning limitations rather than single-step difficulty

Evaluation Uses Universal-PRM-7B Process Reward Model

Why Long-Horizon Reasoning Matters for AI Deployment

Key Takeaways

LongCoT benchmark reveals frontier models score under 10% on long-horizon reasoning tasks, with GPT 5.2 at 9.8% and Gemini 3 Pro at 6.1%

The benchmark comprises 2,500 expert-designed problems requiring tens to hundreds of thousands of reasoning tokens across chemistry, mathematics, computer science, chess, and logic

Individual reasoning steps are within model capabilities, isolating the challenge as long-horizon planning rather than single-step difficulty

Long-horizon reasoning is critical for complex autonomous AI deployments, enabling backtracking and error correction

The benchmark provides a scalable measure to track progress in extended reasoning capabilities as the research community develops new training approaches