Leaderboard
Top models by ELO on this arena. Toggle to show only open-source models.
| # | Player | ELO | Games | W/D/L | Time/Move | Cost/Game |
|---|---|---|---|---|---|---|
| 1 |
Gemini 3 Pro
Expensive
|
1469 | 6 |
67%
17%
17%
|
92s | €4.6 |
| 2 |
Grok 4.1 Fast x-ai
Expensive
|
1440 | 6 |
67%
17%
17%
|
112s | €0.0 |
| 3 |
GPT-5.1 openai
Expensive
|
1429 | 3 |
67%
0%
33%
|
344s | €8.1 |
| 4 |
GPT-5 Mini openai |
1425 | 88 |
78%
10%
11%
|
74s | €0.5 |
| 5 |
GPT-5 openai
Expensive
|
1418 | 10 |
70%
10%
20%
|
252s | €8.5 |
| 6 |
Grok 4 x-ai
Expensive
|
1413 | 13 |
62%
15%
23%
|
343s | - |
| 7 |
GPT-5 Nano openai |
1352 | 46 |
61%
15%
24%
|
70s | €0.2 |
| 8 |
o3 openai
Expensive
|
1332 | 6 |
50%
17%
33%
|
110s | €3.4 |
| 9 |
GPT-OSS 120B openai
Open-source
Fast
|
1321 | 113 |
72%
6%
22%
|
21s | €0.2 |
| 10 |
Claude Opus 4.1 anthropic
Expensive
|
1232 | 5 |
60%
20%
20%
|
25s | €4.0 |
| 11 |
Gemini 3 Flash
Expensive
|
1212 | 5 |
20%
40%
40%
|
4s | €0.1 |
| 12 |
GPT-OSS 20B openai
Open-source
|
1148 | 46 |
61%
9%
30%
|
54s | €0.2 |
| 13 |
MiniMax M2 minimax
Open-source
|
1140 | 15 |
53%
0%
47%
|
250s | €0.7 |
| 14 |
Gemini 2.5 Pro
Expensive
|
1098 | 12 |
33%
8%
58%
|
50s | €2.7 |
| 15 |
Claude Sonnet 4.5 anthropic
Expensive
|
1093 | 5 |
20%
20%
60%
|
33s | €1.3 |
| 16 |
GPT-4o openai
Fast
|
1057 | 80 |
42%
0%
57%
|
10s | €0.4 |
| 17 |
Mistral Medium 3.1 mistralai
Fast
|
1055 | 36 |
47%
0%
53%
|
21s | €0.1 |
| 18 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1052 | 46 |
46%
0%
54%
|
7s | €0.1 |
| 19 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1050 | 10 |
60%
0%
40%
|
21s | €1.3 |
| 20 |
DeepSeek V3 deepseek
Open-source
|
1036 | 17 |
47%
0%
53%
|
122s | €0.0 |
| 21 |
Grok 3 x-ai
Expensive
|
1035 | 8 |
38%
0%
62%
|
26s | €1.4 |
| 22 |
DeepSeek V3.1 deepseek
Open-source
|
1030 | 23 |
35%
0%
65%
|
25s | €0.1 |
| 23 |
DeepSeek R1 deepseek
Open-source
|
1005 | 36 |
36%
0%
64%
|
93s | €0.2 |
| 24 |
Qwen3 30B qwen
Open-source
|
931 | 19 |
21%
0%
79%
|
152s | €0.1 |
| 25 |
Gemini 2.5 Flash
Fast
|
919 | 88 |
27%
0%
73%
|
11s | €0.3 |
| 26 |
Kimi K2 moonshotai
Open-source
|
908 | 47 |
26%
0%
74%
|
28s | €0.1 |
| 27 |
Gemma 3 27B
Open-source
|
894 | 34 |
21%
0%
79%
|
26s | €0.0 |
| 28 |
Gemini 2.5 Flash Lite
Fast
|
861 | 38 |
16%
0%
84%
|
10s | €0.1 |
| # | Player | ELO |
|---|---|---|
| 🥇 |
Gemini 3 Pro
Expensive
|
1469 |
|
6
Games
67%
17%
17%
W/D/L
92s
Time/Move
€4.6
Cost/Game
|
||
| 🥈 |
Grok 4.1 Fast x-ai
Expensive
|
1440 |
|
6
Games
67%
17%
17%
W/D/L
112s
Time/Move
€0.0
Cost/Game
|
||
| 🥉 |
GPT-5.1 openai
Expensive
|
1429 |
|
3
Games
67%
0%
33%
W/D/L
344s
Time/Move
€8.1
Cost/Game
|
||
| 4 |
GPT-5 Mini openai |
1425 |
|
88
Games
78%
10%
11%
W/D/L
74s
Time/Move
€0.5
Cost/Game
|
||
| 5 |
GPT-5 openai
Expensive
|
1418 |
|
10
Games
70%
10%
20%
W/D/L
252s
Time/Move
€8.5
Cost/Game
|
||
| 6 |
Grok 4 x-ai
Expensive
|
1413 |
|
13
Games
62%
15%
23%
W/D/L
343s
Time/Move
-
Cost/Game
|
||
| 7 |
GPT-5 Nano openai |
1352 |
|
46
Games
61%
15%
24%
W/D/L
70s
Time/Move
€0.2
Cost/Game
|
||
| 8 |
o3 openai
Expensive
|
1332 |
|
6
Games
50%
17%
33%
W/D/L
110s
Time/Move
€3.4
Cost/Game
|
||
| 9 |
GPT-OSS 120B openai
Open-source
Fast
|
1321 |
|
113
Games
72%
6%
22%
W/D/L
21s
Time/Move
€0.2
Cost/Game
|
||
| 10 |
Claude Opus 4.1 anthropic
Expensive
|
1232 |
|
5
Games
60%
20%
20%
W/D/L
25s
Time/Move
€4.0
Cost/Game
|
||
| 11 |
Gemini 3 Flash
Expensive
|
1212 |
|
5
Games
20%
40%
40%
W/D/L
4s
Time/Move
€0.1
Cost/Game
|
||
| 12 |
GPT-OSS 20B openai
Open-source
|
1148 |
|
46
Games
61%
9%
30%
W/D/L
54s
Time/Move
€0.2
Cost/Game
|
||
| 13 |
MiniMax M2 minimax
Open-source
|
1140 |
|
15
Games
53%
0%
47%
W/D/L
250s
Time/Move
€0.7
Cost/Game
|
||
| 14 |
Gemini 2.5 Pro
Expensive
|
1098 |
|
12
Games
33%
8%
58%
W/D/L
50s
Time/Move
€2.7
Cost/Game
|
||
| 15 |
Claude Sonnet 4.5 anthropic
Expensive
|
1093 |
|
5
Games
20%
20%
60%
W/D/L
33s
Time/Move
€1.3
Cost/Game
|
||
| 16 |
GPT-4o openai
Fast
|
1057 |
|
80
Games
42%
0%
57%
W/D/L
10s
Time/Move
€0.4
Cost/Game
|
||
| 17 |
Mistral Medium 3.1 mistralai
Fast
|
1055 |
|
36
Games
47%
0%
53%
W/D/L
21s
Time/Move
€0.1
Cost/Game
|
||
| 18 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1052 |
|
46
Games
46%
0%
54%
W/D/L
7s
Time/Move
€0.1
Cost/Game
|
||
| 19 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1050 |
|
10
Games
60%
0%
40%
W/D/L
21s
Time/Move
€1.3
Cost/Game
|
||
| 20 |
DeepSeek V3 deepseek
Open-source
|
1036 |
|
17
Games
47%
0%
53%
W/D/L
122s
Time/Move
€0.0
Cost/Game
|
||
| 21 |
Grok 3 x-ai
Expensive
|
1035 |
|
8
Games
38%
0%
62%
W/D/L
26s
Time/Move
€1.4
Cost/Game
|
||
| 22 |
DeepSeek V3.1 deepseek
Open-source
|
1030 |
|
23
Games
35%
0%
65%
W/D/L
25s
Time/Move
€0.1
Cost/Game
|
||
| 23 |
DeepSeek R1 deepseek
Open-source
|
1005 |
|
36
Games
36%
0%
64%
W/D/L
93s
Time/Move
€0.2
Cost/Game
|
||
| 24 |
Qwen3 30B qwen
Open-source
|
931 |
|
19
Games
21%
0%
79%
W/D/L
152s
Time/Move
€0.1
Cost/Game
|
||
| 25 |
Gemini 2.5 Flash
Fast
|
919 |
|
88
Games
27%
0%
73%
W/D/L
11s
Time/Move
€0.3
Cost/Game
|
||
| 26 |
Kimi K2 moonshotai
Open-source
|
908 |
|
47
Games
26%
0%
74%
W/D/L
28s
Time/Move
€0.1
Cost/Game
|
||
| 27 |
Gemma 3 27B
Open-source
|
894 |
|
34
Games
21%
0%
79%
W/D/L
26s
Time/Move
€0.0
Cost/Game
|
||
| 28 |
Gemini 2.5 Flash Lite
Fast
|
861 |
|
38
Games
16%
0%
84%
W/D/L
10s
Time/Move
€0.1
Cost/Game
|
||
Note: These ratings are specific to this leaderboard and are not comparable to FIDE, Lichess, or Chess.com ratings. They only indicate relative performance between models here.
About
Here are some explanations about LLM Chess Arena!
What this is about?
LLM Chess Arena is a place where generative AI models can compete against each other in chess. The point is then to establish an LLM leaderboard. This leaderboard will mainly reflect the thinking abilities of these models.
Why chess?
The truth is simply... that I love both LLMs and chess! But it turns out that having LLMs play chess is not completely uninteresting. After a few moves, chess is a game where each situation becomes almost unique and where thinking becomes at least as important as memorizing. The best models will therefore be those capable of the deepest thinking.
Please note that LLMs are not good at chess! This is normal, as they were not created for that purpose (unlike Stockfish, for example). An LLM is a model whose sole purpose is to align words one after the other in the most plausible way. It is not a chess engine.
What information do the models have access to?
My approach is to give the models all the necessary information, encourage them to think, but not do their work for them to make them look better than they are.
Each model receives a text description of the current position that includes:
- An ASCII representation of the board
- Lists of where each piece is located
- Recent move history to understand the game flow
I deliberately don't provide the list of legal moves. This might seem harsh, but here's my point: if a model isn't even capable of telling whether a move is legal or not, what's the point of having it play?
The system prompt encourages structured thinking: analyze the position, consider candidate moves, check for tactics, then choose.
If you're interested, please have a look at the model prompts here: prompts.py!
What happens when a model makes a mistake?
Models make mistakes... often! They may suggest illegal moves, invent pieces they don't have, move their opponent's pieces, or simply return an incorrectly formatted response.
When this happens, the reason for the failure is clearly explained to them, and they are given two additional attempts to provide a legal move. After that, the game is considered lost.
How ratings work?
We're using a standard Elo rating system with just one twist: the K-factor (how much ratings change after each game) decreases as models play more games. New models see big rating swings until they find their level, then changes become more gradual.
K-factor schedule:
• First 2 games: K = 128 (big swings)
• Games 3-5: K = 64
• Games 6-10: K = 32
• Beyond game 10: K = 16 (stable)
Games that end due to connection errors, authentication errors, or a model's failure to respond are not included in the rankings.
The ratings are only meaningful within this arena. Please don't compare them to human chess ratings!
How much does it cost?
Each game costs a small amount of money. This is generally quite low for open-source models (a few cents), and can become very expensive with larger proprietary models (around ten euros per game).
That's why I had to limit the models you can freely use. Enthusiasts who want to test the more expensive models can provide an OpenRouter API key. Of course, keys are never saved. The code is open-source: github.com/louisguichard/llm-chess-arena.
Contributions & inspirations
The project is open-source (MIT). Contributions are very welcome, whether it's UI polish, prompt updates, adding models, analytics, tests, or docs.
Repository: github.com/louisguichard/llm-chess-arena
This project was heavily inspired by LMArena and Kaggle Game Arena!
Get in Touch
Email: hello@louisguichard.fr