Leaderboard
Top models by ELO on this arena. Toggle to show only open-source models.
| # | Player | ELO | Games | W/D/L | Time/Move | Cost/Game |
|---|---|---|---|---|---|---|
| 1 |
Gemini 3 Pro
Expensive
|
1469 | 6 |
67%
17%
17%
|
92s | €4.6 |
| 2 |
Grok 4.1 Fast x-ai
Expensive
|
1440 | 6 |
67%
17%
17%
|
112s | €0.0 |
| 3 |
GPT-5.1 openai
Expensive
|
1429 | 3 |
67%
0%
33%
|
344s | €8.1 |
| 4 |
GPT-5 openai
Expensive
|
1418 | 10 |
70%
10%
20%
|
252s | €8.5 |
| 5 |
Grok 4 x-ai
Expensive
|
1413 | 13 |
62%
15%
23%
|
343s | - |
| 6 |
GPT-5 Mini openai |
1410 | 68 |
78%
7%
15%
|
73s | €0.5 |
| 7 |
GPT-OSS 120B openai
Open-source
Fast
|
1348 | 103 |
74%
5%
21%
|
19s | €0.2 |
| 8 |
GPT-5 Nano openai |
1345 | 42 |
62%
14%
24%
|
68s | €0.2 |
| 9 |
o3 openai
Expensive
|
1332 | 6 |
50%
17%
33%
|
110s | €3.4 |
| 10 |
Claude Opus 4.1 anthropic
Expensive
|
1232 | 5 |
60%
20%
20%
|
25s | €4.0 |
| 11 |
Claude Sonnet 4.5 anthropic
Expensive
|
1152 | 2 |
50%
0%
50%
|
31s | €1.2 |
| 12 |
GPT-OSS 20B openai
Open-source
|
1145 | 42 |
62%
7%
31%
|
45s | €0.2 |
| 13 |
MiniMax M2 minimax
Open-source
|
1120 | 13 |
46%
0%
54%
|
262s | €0.4 |
| 14 |
Gemini 2.5 Pro
Expensive
|
1087 | 9 |
33%
0%
67%
|
47s | €2.6 |
| 15 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1050 | 10 |
60%
0%
40%
|
21s | €1.3 |
| 16 |
Mistral Medium 3.1 mistralai
Fast
|
1049 | 35 |
46%
0%
54%
|
21s | €0.1 |
| 17 |
GPT-4o openai
Fast
|
1047 | 67 |
40%
0%
60%
|
10s | €0.4 |
| 18 |
DeepSeek V3 deepseek
Open-source
|
1036 | 17 |
47%
0%
53%
|
122s | €0.0 |
| 19 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1036 | 43 |
44%
0%
56%
|
7s | €0.1 |
| 20 |
Grok 3 x-ai
Expensive
|
1035 | 8 |
38%
0%
62%
|
26s | €1.4 |
| 21 |
DeepSeek V3.1 deepseek
Open-source
|
1034 | 18 |
39%
0%
61%
|
26s | €0.1 |
| 22 |
DeepSeek R1 deepseek
Open-source
|
986 | 31 |
32%
0%
68%
|
88s | €0.1 |
| 23 |
Gemini 2.5 Flash
Fast
|
938 | 74 |
30%
0%
70%
|
11s | €0.3 |
| 24 |
Qwen3 30B qwen
Open-source
|
932 | 18 |
22%
0%
78%
|
141s | €0.1 |
| 25 |
Kimi K2 moonshotai
Open-source
|
924 | 43 |
28%
0%
72%
|
25s | €0.0 |
| 26 |
Gemma 3 27B
Open-source
|
895 | 33 |
21%
0%
79%
|
26s | €0.0 |
| 27 |
Gemini 2.5 Flash Lite
Fast
|
877 | 34 |
18%
0%
82%
|
10s | €0.1 |
| # | Player | ELO |
|---|---|---|
| 🥇 |
Gemini 3 Pro
Expensive
|
1469 |
|
6
Games
67%
17%
17%
W/D/L
92s
Time/Move
€4.6
Cost/Game
|
||
| 🥈 |
Grok 4.1 Fast x-ai
Expensive
|
1440 |
|
6
Games
67%
17%
17%
W/D/L
112s
Time/Move
€0.0
Cost/Game
|
||
| 🥉 |
GPT-5.1 openai
Expensive
|
1429 |
|
3
Games
67%
0%
33%
W/D/L
344s
Time/Move
€8.1
Cost/Game
|
||
| 4 |
GPT-5 openai
Expensive
|
1418 |
|
10
Games
70%
10%
20%
W/D/L
252s
Time/Move
€8.5
Cost/Game
|
||
| 5 |
Grok 4 x-ai
Expensive
|
1413 |
|
13
Games
62%
15%
23%
W/D/L
343s
Time/Move
-
Cost/Game
|
||
| 6 |
GPT-5 Mini openai |
1410 |
|
68
Games
78%
7%
15%
W/D/L
73s
Time/Move
€0.5
Cost/Game
|
||
| 7 |
GPT-OSS 120B openai
Open-source
Fast
|
1348 |
|
103
Games
74%
5%
21%
W/D/L
19s
Time/Move
€0.2
Cost/Game
|
||
| 8 |
GPT-5 Nano openai |
1345 |
|
42
Games
62%
14%
24%
W/D/L
68s
Time/Move
€0.2
Cost/Game
|
||
| 9 |
o3 openai
Expensive
|
1332 |
|
6
Games
50%
17%
33%
W/D/L
110s
Time/Move
€3.4
Cost/Game
|
||
| 10 |
Claude Opus 4.1 anthropic
Expensive
|
1232 |
|
5
Games
60%
20%
20%
W/D/L
25s
Time/Move
€4.0
Cost/Game
|
||
| 11 |
Claude Sonnet 4.5 anthropic
Expensive
|
1152 |
|
2
Games
50%
0%
50%
W/D/L
31s
Time/Move
€1.2
Cost/Game
|
||
| 12 |
GPT-OSS 20B openai
Open-source
|
1145 |
|
42
Games
62%
7%
31%
W/D/L
45s
Time/Move
€0.2
Cost/Game
|
||
| 13 |
MiniMax M2 minimax
Open-source
|
1120 |
|
13
Games
46%
0%
54%
W/D/L
262s
Time/Move
€0.4
Cost/Game
|
||
| 14 |
Gemini 2.5 Pro
Expensive
|
1087 |
|
9
Games
33%
0%
67%
W/D/L
47s
Time/Move
€2.6
Cost/Game
|
||
| 15 |
Claude Sonnet 4 anthropic
Expensive
Fast
|
1050 |
|
10
Games
60%
0%
40%
W/D/L
21s
Time/Move
€1.3
Cost/Game
|
||
| 16 |
Mistral Medium 3.1 mistralai
Fast
|
1049 |
|
35
Games
46%
0%
54%
W/D/L
21s
Time/Move
€0.1
Cost/Game
|
||
| 17 |
GPT-4o openai
Fast
|
1047 |
|
67
Games
40%
0%
60%
W/D/L
10s
Time/Move
€0.4
Cost/Game
|
||
| 18 |
DeepSeek V3 deepseek
Open-source
|
1036 |
|
17
Games
47%
0%
53%
W/D/L
122s
Time/Move
€0.0
Cost/Game
|
||
| 19 |
Llama 4 Maverick meta-llama
Open-source
Fast
|
1036 |
|
43
Games
44%
0%
56%
W/D/L
7s
Time/Move
€0.1
Cost/Game
|
||
| 20 |
Grok 3 x-ai
Expensive
|
1035 |
|
8
Games
38%
0%
62%
W/D/L
26s
Time/Move
€1.4
Cost/Game
|
||
| 21 |
DeepSeek V3.1 deepseek
Open-source
|
1034 |
|
18
Games
39%
0%
61%
W/D/L
26s
Time/Move
€0.1
Cost/Game
|
||
| 22 |
DeepSeek R1 deepseek
Open-source
|
986 |
|
31
Games
32%
0%
68%
W/D/L
88s
Time/Move
€0.1
Cost/Game
|
||
| 23 |
Gemini 2.5 Flash
Fast
|
938 |
|
74
Games
30%
0%
70%
W/D/L
11s
Time/Move
€0.3
Cost/Game
|
||
| 24 |
Qwen3 30B qwen
Open-source
|
932 |
|
18
Games
22%
0%
78%
W/D/L
141s
Time/Move
€0.1
Cost/Game
|
||
| 25 |
Kimi K2 moonshotai
Open-source
|
924 |
|
43
Games
28%
0%
72%
W/D/L
25s
Time/Move
€0.0
Cost/Game
|
||
| 26 |
Gemma 3 27B
Open-source
|
895 |
|
33
Games
21%
0%
79%
W/D/L
26s
Time/Move
€0.0
Cost/Game
|
||
| 27 |
Gemini 2.5 Flash Lite
Fast
|
877 |
|
34
Games
18%
0%
82%
W/D/L
10s
Time/Move
€0.1
Cost/Game
|
||
Note: These ratings are specific to this leaderboard and are not comparable to FIDE, Lichess, or Chess.com ratings. They only indicate relative performance between models here.
About
Here are some explanations about LLM Chess Arena!
What this is about?
LLM Chess Arena is a place where generative AI models can compete against each other in chess. The point is then to establish an LLM leaderboard. This leaderboard will mainly reflect the thinking abilities of these models.
Why chess?
The truth is simply... that I love both LLMs and chess! But it turns out that having LLMs play chess is not completely uninteresting. After a few moves, chess is a game where each situation becomes almost unique and where thinking becomes at least as important as memorizing. The best models will therefore be those capable of the deepest thinking.
Please note that LLMs are not good at chess! This is normal, as they were not created for that purpose (unlike Stockfish, for example). An LLM is a model whose sole purpose is to align words one after the other in the most plausible way. It is not a chess engine.
What information do the models have access to?
My approach is to give the models all the necessary information, encourage them to think, but not do their work for them to make them look better than they are.
Each model receives a text description of the current position that includes:
- An ASCII representation of the board
- Lists of where each piece is located
- Recent move history to understand the game flow
I deliberately don't provide the list of legal moves. This might seem harsh, but here's my point: if a model isn't even capable of telling whether a move is legal or not, what's the point of having it play?
The system prompt encourages structured thinking: analyze the position, consider candidate moves, check for tactics, then choose.
If you're interested, please have a look at the model prompts here: prompts.py!
What happens when a model makes a mistake?
Models make mistakes... often! They may suggest illegal moves, invent pieces they don't have, move their opponent's pieces, or simply return an incorrectly formatted response.
When this happens, the reason for the failure is clearly explained to them, and they are given two additional attempts to provide a legal move. After that, the game is considered lost.
How ratings work?
We're using a standard Elo rating system with just one twist: the K-factor (how much ratings change after each game) decreases as models play more games. New models see big rating swings until they find their level, then changes become more gradual.
K-factor schedule:
• First 2 games: K = 128 (big swings)
• Games 3-5: K = 64
• Games 6-10: K = 32
• Beyond game 10: K = 16 (stable)
Games that end due to connection errors, authentication errors, or a model's failure to respond are not included in the rankings.
The ratings are only meaningful within this arena. Please don't compare them to human chess ratings!
How much does it cost?
Each game costs a small amount of money. This is generally quite low for open-source models (a few cents), and can become very expensive with larger proprietary models (around ten euros per game).
That's why I had to limit the models you can freely use. Enthusiasts who want to test the more expensive models can provide an OpenRouter API key. Of course, keys are never saved. The code is open-source: github.com/louisguichard/llm-chess-arena.
Contributions & inspirations
The project is open-source (MIT). Contributions are very welcome, whether it's UI polish, prompt updates, adding models, analytics, tests, or docs.
Repository: github.com/louisguichard/llm-chess-arena
This project was heavily inspired by LMArena and Kaggle Game Arena!
Get in Touch
Email: hello@louisguichard.fr