Artificial intelligence

LLM benchmark leaderboards

Publiée le January 8, 2026

Rankings and comparative tables: LLM benchmark leaderboards

Why leaderboards?

As language models proliferate, it becomes increasingly difficult to track their performance across all benchmarks. This is why leaderboards have emerged. These platforms compile the results of numerous models on a selection of tests, and update scores as they are published. They act as a showcase for research: laboratories publish their advances, while engineers or decision-makers can consult synthetic data to choose a model. Leaderboards also provide transparency, revealing which model dominates on which tasks, and inviting analysis of the relevance of discrepancies.

Overview of the main platforms

Several leaderboards stand out for their approach and the metrics they highlight. Here’s a roundup of the most recognized platforms in 2026:

Leaderboard	Special features	Types of tasks and metrics used
Vellum	Emphasizes recent tests and removes saturated benchmarks. Ranks models by overall score, but also provides details by task.	Some fifteen tests (reasoning, math, code). Average score, category rank, cost of use.
LLM-Stats	Open source project favoring open models. Each result is accompanied by information on model size and reproducibility.	Benchmarks for comprehension (MMLU, ARC), code (HumanEval) and synthesis. Standard metrics such as accuracy and pass@1.
LiveBench	Dynamic table that regularly runs tests on models and updates scores in real time.	Mix of classic tests and new automatically generated sets to detect regression. Measures latency and cost.
SEAL	Academic initiative with “super-benchmarks” combining several games in a single score.	Provides a unified score (SuperScore) based on a mix of MMLU, TruthfulQA, HellaSwag, etc. Also provides weightings by category.
Chatbot Arena	Community platform where users directly compare two models in real-life situations (online chat).	Results based on thousands of anonymous duels rated by Internet users. Establishes an Elo ranking reflecting user preference.

Leaderboard

Special features

Types of tasks and metrics used

Vellum

Emphasizes recent tests and removes saturated benchmarks. Ranks models by overall score, but also provides details by task.

Some fifteen tests (reasoning, math, code). Average score, category rank, cost of use.

LLM-Stats

Open source project favoring open models. Each result is accompanied by information on model size and reproducibility.

Benchmarks for comprehension (MMLU, ARC), code (HumanEval) and synthesis. Standard metrics such as accuracy and pass@1.

LiveBench

Dynamic table that regularly runs tests on models and updates scores in real time.

Mix of classic tests and new automatically generated sets to detect regression. Measures latency and cost.

SEAL

Academic initiative with “super-benchmarks” combining several games in a single score.

Provides a unified score (SuperScore) based on a mix of MMLU, TruthfulQA, HellaSwag, etc. Also provides weightings by category.

Chatbot Arena

Community platform where users directly compare two models in real-life situations (online chat).

Results based on thousands of anonymous duels rated by Internet users. Establishes an Elo ranking reflecting user preference.

Each of these platforms offers specific features. Vellum, for example, highlights the best-performing models on the latest benchmark versions and removes those that have become too easy or contaminated. LLM-Stats, which is open source oriented, enables results to be reproduced locally. LiveBench measures not only accuracy, but also the speed and cost of inference, crucial factors for industrialization. SEAL seeks to summarize performance in a single index to simplify comparisons. Finally, Chatbot Arena stands out for its participative approach: it’s the users themselves who decide which model seems best to them, by pitting them against each other in blind duels.

Understanding scores

Models are often ranked according to an aggregate score, calculated as the average (or a weighted combination) of results on different benchmarks. However, this average sometimes masks significant disparities. For example, a model may score 95% on mathematical questions, but 70% on general knowledge. Depending on the intended use, a balanced model may be preferable to a niche champion. What’s more, some leaderboards normalize scores to take account of model size or cost, while others take only raw accuracy into account.

In addition to accuracy percentages, new indicators appear on the ranking tables:

Cost per token: expressed in cents, this is used to estimate the price of an API call for a given model.

Latency: time required to generate a certain number of tokens (TTFT and inter-token latency). Platforms like LiveBench highlight these metrics to help choose a responsive model.

Human score: on Chatbot Arena, users assign qualitative scores. This offers a complementary point of view to the technical metrics.

Energy consumption: some rankings are starting to measure the carbon footprint of inference, to promote more sustainable solutions.

Careful interpretation

While leaderboards are handy, there are a few things to keep in mind:

Ranking volatility: the order can change rapidly with each model release or benchmark update. A top 1 today may become second the following week.

Test selection: some tables favor specific benchmarks that favor certain architectures. A model trained on code naturally shines on HumanEval.

Lack of application testing: few leaderboards include complex or multi-stage scenarios. It is therefore advisable to supplement this data with your own tests.

Variation according to settings: temperature, top-k and other parameters influence results. Platforms try to harmonize evaluation conditions, but differences remain.

Choose your model using leaderboards

To make the most of these rankings :

Select platforms aligned with your objectives: if you’re looking for an open source model, go for LLM-Stats. For interactive use, check out Chatbot Arena. If speed is critical, take a look at LiveBench.

Analyze detailed scores: instead of focusing on the average, look at performance by task. Use a comparison table to identify the best models in each category.

Consider cost and latency: a slightly less accurate but more economical model may be preferable in a production context.

Test in your own environment: import several models and run internal tests on your data. Benchmarks don’t always reflect the subtleties of your field.

Conclusion

Leaderboards play an essential role in tracking the rapid evolution of major language models. They summarize hundreds of results, facilitating comparison and technology watch. However, the discerning user needs to keep a critical eye: understand the methodology behind each ranking, analyze the detailed scores and complete the evaluation with tests of their own. By combining these sources, it is possible to make an informed choice of model, taking into account accuracy, cost, speed and suitability for the intended use cases.

LLM benchmark leaderboards

Rankings and comparative tables: LLM benchmark leaderboards

Why leaderboards?

Overview of the main platforms

Understanding scores

Careful interpretation

Choose your model using leaderboards

Conclusion

Autres articles

ChatGPT, Gemini and Copilot are already visiting your site: are you ready for invisible AI traffic?

The ChatGPT paradox: a decisive role in conversion

Towards a Search Console for AI agents? Bing leads the way with Copilot