Researchers have released OpenDeepThink, a test-time compute scaling framework that improves LLM reasoning through parallel candidate generation and pairwise Bradley-Terry comparison. Published May 14, 2026 on arXiv by Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, and Jingbo Shang, the method raised Gemini 3.1 Pro's Codeforces Elo by 405 points in approximately 27 minutes of compute time.
Bradley-Terry Aggregation Solves the Selection Bottleneck
Most test-time scaling methods extend reasoning depth by generating longer single traces. Scaling breadth by sampling multiple parallel candidates faces a critical challenge: selecting the best answer without ground-truth verification. Pointwise LLM judging proves noisy and biased.
OpenDeepThink addresses this through pairwise comparison and Bradley-Terry ranking:
- Each generation round, the LLM judges random pairs of candidate solutions
- Votes aggregate into a global ranking via the Bradley-Terry probability model
- Top-ranked candidates are preserved between rounds
- The top 75% of candidates are mutated using natural-language critiques generated during pairwise comparison
- The bottom 25% are discarded
This population-based approach completed eight sequential LLM-call rounds in approximately 27 minutes of wall-clock time.
Strong Performance on Competitive Programming
The framework delivered a 405 Elo improvement for Gemini 3.1 Pro on Codeforces problems. This gain is substantial in competitive programming, where even 50-100 Elo improvements typically require significant model upgrades.
The pipeline transfers across model tiers without requiring hyperparameter retuning, suggesting the approach is model-agnostic.
Domain-Specific Results and Dataset Release
On the multi-domain HLE benchmark, OpenDeepThink showed concentrated gains in objectively verifiable domains. Performance reversed in subjective evaluation domains, indicating the pairwise comparison method works best with clear correctness criteria.
The researchers released CF-73, a curated dataset of 73 expert-rated Codeforces problems. Each problem includes International Grandmaster annotation and shows 99% local-evaluation agreement with official verdicts.
Why Pairwise Comparison Outperforms Pointwise Judging
The Bradley-Terry model ranks items based on pairwise win frequencies rather than absolute scores. This proves more robust because it aggregates multiple comparative judgments instead of relying on potentially biased single-point assessments.
The mutation mechanism leverages the natural-language critiques produced during comparison rounds, allowing candidates to improve based on specific identified weaknesses rather than generic refinement prompts.
Key Takeaways
- OpenDeepThink improved Gemini 3.1 Pro's Codeforces Elo by 405 points through pairwise Bradley-Terry comparison
- The framework scales test-time compute in breadth rather than depth, generating and selecting among multiple parallel reasoning traces
- The method transfers across different model strengths without hyperparameter retuning
- Performance gains concentrate in objectively verifiable domains with clear correctness criteria
- The researchers released CF-73, a dataset of 73 expert-annotated competitive programming problems with 99% evaluation agreement