As language models proliferate, it becomes increasingly difficult to track their performance across all benchmarks. This is why leaderboards have emerged. These platforms compile the results of numerous models on a selection of tests, and update scores as they are published. They act as a showcase for research: laboratories publish their advances, while engineers or decision-makers can consult synthetic data to choose a model. Leaderboards also provide transparency, revealing which model dominates on which tasks, and inviting analysis of the relevance of discrepancies.
Several leaderboards stand out for their approach and the metrics they highlight. Here’s a roundup of the most recognized platforms in 2026:
| Leaderboard | Special features | Types of tasks and metrics used |
| Vellum | Emphasizes recent tests and removes saturated benchmarks. Ranks models by overall score, but also provides details by task. | Some fifteen tests (reasoning, math, code). Average score, category rank, cost of use. |
| LLM-Stats | Open source project favoring open models. Each result is accompanied by information on model size and reproducibility. | Benchmarks for comprehension (MMLU, ARC), code (HumanEval) and synthesis. Standard metrics such as accuracy and pass@1. |
| LiveBench | Dynamic table that regularly runs tests on models and updates scores in real time. | Mix of classic tests and new automatically generated sets to detect regression. Measures latency and cost. |
| SEAL | Academic initiative with “super-benchmarks” combining several games in a single score. | Provides a unified score (SuperScore) based on a mix of MMLU, TruthfulQA, HellaSwag, etc. Also provides weightings by category. |
| Chatbot Arena | Community platform where users directly compare two models in real-life situations (online chat). | Results based on thousands of anonymous duels rated by Internet users. Establishes an Elo ranking reflecting user preference. |
Each of these platforms offers specific features. Vellum, for example, highlights the best-performing models on the latest benchmark versions and removes those that have become too easy or contaminated. LLM-Stats, which is open source oriented, enables results to be reproduced locally. LiveBench measures not only accuracy, but also the speed and cost of inference, crucial factors for industrialization. SEAL seeks to summarize performance in a single index to simplify comparisons. Finally, Chatbot Arena stands out for its participative approach: it’s the users themselves who decide which model seems best to them, by pitting them against each other in blind duels.
Models are often ranked according to an aggregate score, calculated as the average (or a weighted combination) of results on different benchmarks. However, this average sometimes masks significant disparities. For example, a model may score 95% on mathematical questions, but 70% on general knowledge. Depending on the intended use, a balanced model may be preferable to a niche champion. What’s more, some leaderboards normalize scores to take account of model size or cost, while others take only raw accuracy into account.
In addition to accuracy percentages, new indicators appear on the ranking tables:
While leaderboards are handy, there are a few things to keep in mind:
To make the most of these rankings :
Leaderboards play an essential role in tracking the rapid evolution of major language models. They summarize hundreds of results, facilitating comparison and technology watch. However, the discerning user needs to keep a critical eye: understand the methodology behind each ranking, analyze the detailed scores and complete the evaluation with tests of their own. By combining these sources, it is possible to make an informed choice of model, taking into account accuracy, cost, speed and suitability for the intended use cases.