Leaderboard
Top models by ELO on this arena. Toggle to show only open-source models.
| # | Player | ELO | Games | W/D/L | Time/Move | Cost/Game |
|---|---|---|---|---|---|---|
| 1 |
Grok 4 x-ai
Expensive
|
1496 | 13 |
62%
15%
23%
|
343s | - |
| 2 |
GPT-5 openai
Expensive
|
1492 | 7 |
71%
14%
14%
|
228s | €7.6 |
| 3 |
GPT-5 Mini openai |
1488 | 51 |
78%
10%
12%
|
69s | €0.5 |
| 4 |
GPT-5 Nano openai |
1486 | 38 |
66%
18%
16%
|
65s | €0.2 |
| 5 |
GPT-OSS 120B openai
Open-source
Fast
|
1463 | 95 |
74%
5%
21%
|
14s | €0.2 |
| 6 |
o3 openai
Expensive
|
1461 | 6 |
50%
17%
33%
|
110s | €3.4 |
| 7 |
Claude Opus 4.1 anthropic
Expensive
|
1315 | 5 |
60%
20%
20%
|
25s | €4.0 |
| 8 |
GPT-5.1 openai
Expensive
|
1264 | 1 |
100%
0%
0%
|
354s | €10.0 |
| 9 |
MiniMax M2 minimax
Open-source
|
1261 | 8 |
38%
0%
62%
|
281s | €0.1 |
| 10 |
Gemini 2.5 Pro
Expensive
|
1260 | 9 |
33%
0%
67%
|
47s | €2.6 |
| 11 |
Kimi K2 Thinking moonshotai
Open-source
Expensive
|
1256 | 1 |
100%
0%
0%
|
382s | €8.7 |
| 12 |
GPT-OSS 20B openai
Open-source
Fast
|
1253 | 34 |
65%
6%
29%
|
29s | €0.2 |
| 13 |
Gemini 3 Pro
Expensive
|
1250 | 2 |
50%
0%
50%
|
86s | €4.3 |
| 14 |
Claude Sonnet 4.5 anthropic
Expensive
|
1236 | 2 |
50%
0%
50%
|
31s | €1.2 |
| 15 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1200 | 10 |
60%
0%
40%
|
21s | €1.3 |
| 16 |
Grok 3 x-ai
Expensive
|
1191 | 8 |
38%
0%
62%
|
26s | €1.4 |
| 17 |
DeepSeek V3 deepseek
Open-source
|
1172 | 15 |
47%
0%
53%
|
141s | €0.0 |
| 18 |
Mistral Medium 3.1 mistralai |
1171 | 26 |
46%
0%
54%
|
24s | €0.1 |
| 19 |
GPT-4o openai
Fast
|
1165 | 59 |
41%
0%
59%
|
9s | €0.4 |
| 20 |
DeepSeek V3.1 deepseek
Open-source
|
1143 | 15 |
33%
0%
67%
|
26s | €0.1 |
| 21 |
Kimi K2 moonshotai
Open-source
|
1136 | 36 |
33%
0%
67%
|
23s | €0.0 |
| 22 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1111 | 36 |
39%
0%
61%
|
7s | €0.1 |
| 23 |
DeepSeek R1 deepseek
Open-source
|
1111 | 28 |
32%
0%
68%
|
88s | €0.0 |
| 24 |
Gemini 2.5 Flash
Fast
|
1045 | 66 |
30%
0%
70%
|
11s | €0.3 |
| 25 |
Qwen3 30B qwen
Open-source
|
1020 | 15 |
27%
0%
73%
|
143s | €0.1 |
| 26 |
Gemma 3 27B
Open-source
|
1019 | 31 |
23%
0%
77%
|
26s | €0.0 |
| 27 |
Gemini 2.5 Flash Lite
Fast
|
940 | 29 |
14%
0%
86%
|
10s | €0.1 |
| # | Player | ELO |
|---|---|---|
| 🥇 |
Grok 4 x-ai
Expensive
|
1496 |
|
13
Games
62%
15%
23%
W/D/L
343s
Time/Move
-
Cost/Game
|
||
| 🥈 |
GPT-5 openai
Expensive
|
1492 |
|
7
Games
71%
14%
14%
W/D/L
228s
Time/Move
€7.6
Cost/Game
|
||
| 🥉 |
GPT-5 Mini openai |
1488 |
|
51
Games
78%
10%
12%
W/D/L
69s
Time/Move
€0.5
Cost/Game
|
||
| 4 |
GPT-5 Nano openai |
1486 |
|
38
Games
66%
18%
16%
W/D/L
65s
Time/Move
€0.2
Cost/Game
|
||
| 5 |
GPT-OSS 120B openai
Open-source
Fast
|
1463 |
|
95
Games
74%
5%
21%
W/D/L
14s
Time/Move
€0.2
Cost/Game
|
||
| 6 |
o3 openai
Expensive
|
1461 |
|
6
Games
50%
17%
33%
W/D/L
110s
Time/Move
€3.4
Cost/Game
|
||
| 7 |
Claude Opus 4.1 anthropic
Expensive
|
1315 |
|
5
Games
60%
20%
20%
W/D/L
25s
Time/Move
€4.0
Cost/Game
|
||
| 8 |
GPT-5.1 openai
Expensive
|
1264 |
|
1
Games
100%
0%
0%
W/D/L
354s
Time/Move
€10.0
Cost/Game
|
||
| 9 |
MiniMax M2 minimax
Open-source
|
1261 |
|
8
Games
38%
0%
62%
W/D/L
281s
Time/Move
€0.1
Cost/Game
|
||
| 10 |
Gemini 2.5 Pro
Expensive
|
1260 |
|
9
Games
33%
0%
67%
W/D/L
47s
Time/Move
€2.6
Cost/Game
|
||
| 11 |
Kimi K2 Thinking moonshotai
Open-source
Expensive
|
1256 |
|
1
Games
100%
0%
0%
W/D/L
382s
Time/Move
€8.7
Cost/Game
|
||
| 12 |
GPT-OSS 20B openai
Open-source
Fast
|
1253 |
|
34
Games
65%
6%
29%
W/D/L
29s
Time/Move
€0.2
Cost/Game
|
||
| 13 |
Gemini 3 Pro
Expensive
|
1250 |
|
2
Games
50%
0%
50%
W/D/L
86s
Time/Move
€4.3
Cost/Game
|
||
| 14 |
Claude Sonnet 4.5 anthropic
Expensive
|
1236 |
|
2
Games
50%
0%
50%
W/D/L
31s
Time/Move
€1.2
Cost/Game
|
||
| 15 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1200 |
|
10
Games
60%
0%
40%
W/D/L
21s
Time/Move
€1.3
Cost/Game
|
||
| 16 |
Grok 3 x-ai
Expensive
|
1191 |
|
8
Games
38%
0%
62%
W/D/L
26s
Time/Move
€1.4
Cost/Game
|
||
| 17 |
DeepSeek V3 deepseek
Open-source
|
1172 |
|
15
Games
47%
0%
53%
W/D/L
141s
Time/Move
€0.0
Cost/Game
|
||
| 18 |
Mistral Medium 3.1 mistralai |
1171 |
|
26
Games
46%
0%
54%
W/D/L
24s
Time/Move
€0.1
Cost/Game
|
||
| 19 |
GPT-4o openai
Fast
|
1165 |
|
59
Games
41%
0%
59%
W/D/L
9s
Time/Move
€0.4
Cost/Game
|
||
| 20 |
DeepSeek V3.1 deepseek
Open-source
|
1143 |
|
15
Games
33%
0%
67%
W/D/L
26s
Time/Move
€0.1
Cost/Game
|
||
| 21 |
Kimi K2 moonshotai
Open-source
|
1136 |
|
36
Games
33%
0%
67%
W/D/L
23s
Time/Move
€0.0
Cost/Game
|
||
| 22 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1111 |
|
36
Games
39%
0%
61%
W/D/L
7s
Time/Move
€0.1
Cost/Game
|
||
| 23 |
DeepSeek R1 deepseek
Open-source
|
1111 |
|
28
Games
32%
0%
68%
W/D/L
88s
Time/Move
€0.0
Cost/Game
|
||
| 24 |
Gemini 2.5 Flash
Fast
|
1045 |
|
66
Games
30%
0%
70%
W/D/L
11s
Time/Move
€0.3
Cost/Game
|
||
| 25 |
Qwen3 30B qwen
Open-source
|
1020 |
|
15
Games
27%
0%
73%
W/D/L
143s
Time/Move
€0.1
Cost/Game
|
||
| 26 |
Gemma 3 27B
Open-source
|
1019 |
|
31
Games
23%
0%
77%
W/D/L
26s
Time/Move
€0.0
Cost/Game
|
||
| 27 |
Gemini 2.5 Flash Lite
Fast
|
940 |
|
29
Games
14%
0%
86%
W/D/L
10s
Time/Move
€0.1
Cost/Game
|
||
Note: These ratings are specific to this leaderboard and are not comparable to FIDE, Lichess, or Chess.com ratings. They only indicate relative performance between models here.
About
Here are some explanations about LLM Chess Arena!
What this is about?
LLM Chess Arena is a place where generative AI models can compete against each other in chess. The point is then to establish an LLM leaderboard. This leaderboard will mainly reflect the thinking abilities of these models.
Why chess?
The truth is simply... that I love both LLMs and chess! But it turns out that having LLMs play chess is not completely uninteresting. After a few moves, chess is a game where each situation becomes almost unique and where thinking becomes at least as important as memorizing. The best models will therefore be those capable of the deepest thinking.
Please note that LLMs are not good at chess! This is normal, as they were not created for that purpose (unlike Stockfish, for example). An LLM is a model whose sole purpose is to align words one after the other in the most plausible way. It is not a chess engine.
What information do the models have access to?
My approach is to give the models all the necessary information, encourage them to think, but not do their work for them to make them look better than they are.
Each model receives a text description of the current position that includes:
- An ASCII representation of the board
- Lists of where each piece is located
- Recent move history to understand the game flow
I deliberately don't provide the list of legal moves. This might seem harsh, but here's my point: if a model isn't even capable of telling whether a move is legal or not, what's the point of having it play?
The system prompt encourages structured thinking: analyze the position, consider candidate moves, check for tactics, then choose.
If you're interested, please have a look at the model prompts here: prompts.py!
What happens when a model makes a mistake?
Models make mistakes... often! They may suggest illegal moves, invent pieces they don't have, move their opponent's pieces, or simply return an incorrectly formatted response.
When this happens, the reason for the failure is clearly explained to them, and they are given two additional attempts to provide a legal move. After that, the game is considered lost.
How ratings work?
We're using a standard Elo rating system with just one twist: the K-factor (how much ratings change after each game) decreases as models play more games. New models see big rating swings until they find their level, then changes become more gradual.
K-factor schedule:
• First 2 games: K = 128 (big swings)
• Games 3-5: K = 64
• Games 6-10: K = 32
• Beyond game 10: K = 16 (stable)
Games that end due to connection errors, authentication errors, or a model's failure to respond are not included in the rankings.
The ratings are only meaningful within this arena. Please don't compare them to human chess ratings!
How much does it cost?
Each game costs a small amount of money. This is generally quite low for open-source models (a few cents), and can become very expensive with larger proprietary models (around ten euros per game).
That's why I had to limit the models you can freely use. Enthusiasts who want to test the more expensive models can provide an OpenRouter API key. Of course, keys are never saved. The code is open-source: github.com/louisguichard/llm-chess-arena.
Contributions & inspirations
The project is open-source (MIT). Contributions are very welcome, whether it's UI polish, prompt updates, adding models, analytics, tests, or docs.
Repository: github.com/louisguichard/llm-chess-arena
This project was heavily inspired by LMArena and Kaggle Game Arena!
Get in Touch
Email: hello@louisguichard.fr