Researchers analyzing approximately 89,000 pairwise human feedback comparisons across 116 languages from 52 LLMs found that global leaderboard rankings are fundamentally misleading. Nearly two-thirds of decisive votes cancel out when creating a global ranking, and even the top 50 models are statistically indistinguishable with pairwise win probabilities at most 0.53.
Language Heterogeneity Drives Ranking Inconsistencies
The study, published on arXiv on May 7, 2026, reveals that the failure of global rankings traces to "strong, structured heterogeneity of opinions across language, task, and time." Language plays the key role: grouping by language and language families increases agreement massively, resulting in "two orders of magnitude higher spread in the ELO scores."
What appears as global noise is actually "a mixture of coherent but conflicting subpopulations." The researchers found that creating separate rankings by language produces far more consistent and meaningful results than attempting to create a single global leaderboard.
New Portfolio Approach Covers 96% of Votes
To address heterogeneity, researchers introduced (λ, ν)-portfolios—small sets of models achieving prediction error at most λ while covering at least ν fraction of users. Their algorithms recovered just 5 distinct Bradley-Terry rankings that cover over 96% of votes at a modest λ, compared to 21% coverage by the global ranking.
The researchers also constructed a portfolio of 6 LLMs covering twice as many votes as the top-6 LLMs from a global ranking. This demonstrates that a small set of models selected to account for heterogeneity can serve users far more effectively than the highest-ranked models on a global leaderboard.
Implications for LLM Evaluation and Selection
The findings have significant implications for how organizations and researchers should approach LLM evaluation and selection. The paper suggests that global leaderboards like those from Chatbot Arena data may mislead users into selecting models that perform poorly for their specific language, task, or use case.
The research was conducted by Jai Moondra, Ayela Chughtai, Bhargavi Lanka, and Swati Gupta, and published as "Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML" on arXiv.
Key Takeaways
- Analysis of 89,000 pairwise comparisons across 116 languages from 52 LLMs reveals that nearly two-thirds of decisive votes cancel out in global rankings
- Even the top 50 models according to global Bradley-Terry rankings are statistically indistinguishable, with pairwise win probabilities at most 0.53
- Language heterogeneity is the primary driver, with grouping by language producing two orders of magnitude higher spread in ELO scores than global rankings
- Researchers' portfolio approach using just 5 distinct rankings covers over 96% of votes, compared to 21% coverage by the global ranking
- A portfolio of 6 carefully selected LLMs can cover twice as many votes as the top-6 LLMs from a global ranking, suggesting fundamental flaws in current leaderboard methodologies