OpenDeepThink Framework Boosts Gemini Reasoning by 405 Elo Using Pairwise Comparisons

Researchers have released OpenDeepThink, a test-time compute scaling framework that improves LLM reasoning through parallel candidate generation and pairwise Bradley-Terry comparison. Published May 14, 2026 on arXiv by Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, and Jingbo Shang, the method raised Gemini 3.1 Pro's Codeforces Elo by 405 points in approximately 27 minutes of compute time.

Bradley-Terry Aggregation Solves the Selection Bottleneck

Most test-time scaling methods extend reasoning depth by generating longer single traces. Scaling breadth by sampling multiple parallel candidates faces a critical challenge: selecting the best answer without ground-truth verification. Pointwise LLM judging proves noisy and biased.

OpenDeepThink addresses this through pairwise comparison and Bradley-Terry ranking:

Each generation round, the LLM judges random pairs of candidate solutions
Votes aggregate into a global ranking via the Bradley-Terry probability model
Top-ranked candidates are preserved between rounds
The top 75% of candidates are mutated using natural-language critiques generated during pairwise comparison
The bottom 25% are discarded

This population-based approach completed eight sequential LLM-call rounds in approximately 27 minutes of wall-clock time.

Strong Performance on Competitive Programming

The framework delivered a 405 Elo improvement for Gemini 3.1 Pro on Codeforces problems. This gain is substantial in competitive programming, where even 50-100 Elo improvements typically require significant model upgrades.

The pipeline transfers across model tiers without requiring hyperparameter retuning, suggesting the approach is model-agnostic.

Domain-Specific Results and Dataset Release

On the multi-domain HLE benchmark, OpenDeepThink showed concentrated gains in objectively verifiable domains. Performance reversed in subjective evaluation domains, indicating the pairwise comparison method works best with clear correctness criteria.

The researchers released CF-73, a curated dataset of 73 expert-rated Codeforces problems. Each problem includes International Grandmaster annotation and shows 99% local-evaluation agreement with official verdicts.

Why Pairwise Comparison Outperforms Pointwise Judging

The Bradley-Terry model ranks items based on pairwise win frequencies rather than absolute scores. This proves more robust because it aggregates multiple comparative judgments instead of relying on potentially biased single-point assessments.

The mutation mechanism leverages the natural-language critiques produced during comparison rounds, allowing candidates to improve based on specific identified weaknesses rather than generic refinement prompts.

Key Takeaways

OpenDeepThink improved Gemini 3.1 Pro's Codeforces Elo by 405 points through pairwise Bradley-Terry comparison
The framework scales test-time compute in breadth rather than depth, generating and selecting among multiple parallel reasoning traces
The method transfers across different model strengths without hyperparameter retuning
Performance gains concentrate in objectively verifiable domains with clear correctness criteria
The researchers released CF-73, a dataset of 73 expert-annotated competitive programming problems with 99% evaluation agreement

Bradley-Terry Aggregation Solves the Selection Bottleneck

OpenDeepThink addresses this through pairwise comparison and Bradley-Terry ranking:

Each generation round, the LLM judges random pairs of candidate solutions

Votes aggregate into a global ranking via the Bradley-Terry probability model

Top-ranked candidates are preserved between rounds

The top 75% of candidates are mutated using natural-language critiques generated during pairwise comparison

The bottom 25% are discarded

This population-based approach completed eight sequential LLM-call rounds in approximately 27 minutes of wall-clock time.

Strong Performance on Competitive Programming

The pipeline transfers across model tiers without requiring hyperparameter retuning, suggesting the approach is model-agnostic.

Domain-Specific Results and Dataset Release

Why Pairwise Comparison Outperforms Pointwise Judging

Key Takeaways

OpenDeepThink improved Gemini 3.1 Pro's Codeforces Elo by 405 points through pairwise Bradley-Terry comparison

The framework scales test-time compute in breadth rather than depth, generating and selecting among multiple parallel reasoning traces

The method transfers across different model strengths without hyperparameter retuning

Performance gains concentrate in objectively verifiable domains with clear correctness criteria

The researchers released CF-73, a dataset of 73 expert-annotated competitive programming problems with 99% evaluation agreement