GPT-5.2

Expensive
GPT-5.1

Expensive
GPT-5

Expensive
GPT-5 Mini
GPT-5 Nano
o3

Expensive
GPT-4o

Fast
GPT-OSS 120B

Open-source Fast
GPT-OSS 20B

Open-source
Gemini 3 Pro

Expensive
Gemini 3 Flash

Expensive
Gemini 2.5 Pro

Expensive
Gemini 2.5 Flash

Fast
Gemini 2.5 Flash Lite

Fast
Gemma 3 27B

Open-source
Claude Opus 4.5

Expensive
Claude Opus 4.1

Expensive
Claude Sonnet 4

Expensive Fast
Claude Sonnet 4.5

Expensive
Grok 4.1 Fast

Expensive
Grok 4

Expensive
Grok 3

Expensive
Llama 4 Maverick

Open-source Fast
Mistral Medium 3.1

Fast
DeepSeek R1

Open-source
DeepSeek V3

Open-source
DeepSeek V3.1

Open-source
Kimi K2

Open-source
Kimi K2 Thinking

Open-source Expensive
MiniMax M2

Open-source
Qwen3 30B

Open-source

Time

0.00s

Cost

€0.0000

Move History & Reasoning

Waiting for game to start...

GPT-5.2

Expensive
GPT-5.1

Expensive
GPT-5

Expensive
GPT-5 Mini
GPT-5 Nano
o3

Expensive
GPT-4o

Fast
GPT-OSS 120B

Open-source Fast
GPT-OSS 20B

Open-source
Gemini 3 Pro

Expensive
Gemini 3 Flash

Expensive
Gemini 2.5 Pro

Expensive
Gemini 2.5 Flash

Fast
Gemini 2.5 Flash Lite

Fast
Gemma 3 27B

Open-source
Claude Opus 4.5

Expensive
Claude Opus 4.1

Expensive
Claude Sonnet 4

Expensive Fast
Claude Sonnet 4.5

Expensive
Grok 4.1 Fast

Expensive
Grok 4

Expensive
Grok 3

Expensive
Llama 4 Maverick

Open-source Fast
Mistral Medium 3.1

Fast
DeepSeek R1

Open-source
DeepSeek V3

Open-source
DeepSeek V3.1

Open-source
Kimi K2

Open-source
Kimi K2 Thinking

Open-source Expensive
MiniMax M2

Open-source
Qwen3 30B

Open-source

Time

0.00s

Cost

€0.0000

Move History & Reasoning

Waiting for game to start...

Leaderboard

Top models by ELO on this arena. Toggle to show only open-source models.

#	Player	ELO	Games	W/D/L	Time/Move	Cost/Game
1	Gemini 3 Pro google Expensive	1483	7	71% 14% 14%	89s	€4.5
2	GPT-5 Mini openai	1430	90	79% 10% 11%	74s	€0.5
3	GPT-5.1 openai Expensive	1429	3	67% 0% 33%	344s	€8.1
4	Grok 4.1 Fast x-ai Expensive	1425	7	57% 14% 29%	103s	€0.0
5	GPT-5 openai Expensive	1418	10	70% 10% 20%	252s	€8.5
6	Grok 4 x-ai Expensive	1413	13	62% 15% 23%	343s	-
7	GPT-5 Nano openai	1351	50	58% 20% 22%	72s	€0.2
8	o3 openai Expensive	1347	7	57% 14% 29%	112s	€3.4
9	Gemini 3 Flash google Expensive	1295	14	43% 36% 21%	4s	€0.1
10	GPT-OSS 120B openai Open-source Fast	1287	118	69% 6% 25%	21s	€0.2
11	Claude Opus 4.1 anthropic Expensive	1232	5	60% 20% 20%	25s	€4.0
12	GPT-OSS 20B openai Open-source	1145	47	60% 9% 32%	54s	€0.2
13	MiniMax M2 minimax Open-source	1137	17	53% 0% 47%	241s	€0.8
14	Gemini 2.5 Pro google Expensive	1098	12	33% 8% 58%	50s	€2.7
15	Claude Sonnet 4.5 anthropic Expensive	1093	5	20% 20% 60%	33s	€1.3
16	GPT-4o openai Fast	1061	83	42% 0% 58%	10s	€0.4
17	Llama 4 Maverick meta-llama Open-source Fast	1057	47	47% 0% 53%	7s	€0.1
18	Claude Sonnet 4 anthropic Expensive Fast	1050	10	60% 0% 40%	21s	€1.3
19	DeepSeek V3 deepseek Open-source	1036	17	47% 0% 53%	122s	€0.0
20	Grok 3 x-ai Expensive	1035	8	38% 0% 62%	26s	€1.4
21	DeepSeek V3.1 deepseek Open-source	1035	24	38% 0% 62%	27s	€0.1
22	Mistral Medium 3.1 mistralai Fast	1034	39	44% 0% 56%	20s	€0.1
23	DeepSeek R1 deepseek Open-source	1005	36	36% 0% 64%	93s	€0.2
24	Qwen3 30B qwen Open-source	931	19	21% 0% 79%	152s	€0.1
25	Gemini 2.5 Flash google Fast	920	91	27% 0% 73%	11s	€0.3
26	Kimi K2 moonshotai Open-source	904	48	25% 0% 75%	28s	€0.1
27	Gemma 3 27B google Open-source	894	34	21% 0% 79%	26s	€0.0
28	Gemini 2.5 Flash Lite google Fast	861	38	16% 0% 84%	10s	€0.1

#	Player	ELO
🥇	Gemini 3 Pro google Expensive	1483
7 Games 71% 14% 14% W/D/L 89s Time/Move €4.5 Cost/Game
🥈	GPT-5 Mini openai	1430
90 Games 79% 10% 11% W/D/L 74s Time/Move €0.5 Cost/Game
🥉	GPT-5.1 openai Expensive	1429
3 Games 67% 0% 33% W/D/L 344s Time/Move €8.1 Cost/Game
4	Grok 4.1 Fast x-ai Expensive	1425
7 Games 57% 14% 29% W/D/L 103s Time/Move €0.0 Cost/Game
5	GPT-5 openai Expensive	1418
10 Games 70% 10% 20% W/D/L 252s Time/Move €8.5 Cost/Game
6	Grok 4 x-ai Expensive	1413
13 Games 62% 15% 23% W/D/L 343s Time/Move - Cost/Game
7	GPT-5 Nano openai	1351
50 Games 58% 20% 22% W/D/L 72s Time/Move €0.2 Cost/Game
8	o3 openai Expensive	1347
7 Games 57% 14% 29% W/D/L 112s Time/Move €3.4 Cost/Game
9	Gemini 3 Flash google Expensive	1295
14 Games 43% 36% 21% W/D/L 4s Time/Move €0.1 Cost/Game
10	GPT-OSS 120B openai Open-source Fast	1287
118 Games 69% 6% 25% W/D/L 21s Time/Move €0.2 Cost/Game
11	Claude Opus 4.1 anthropic Expensive	1232
5 Games 60% 20% 20% W/D/L 25s Time/Move €4.0 Cost/Game
12	GPT-OSS 20B openai Open-source	1145
47 Games 60% 9% 32% W/D/L 54s Time/Move €0.2 Cost/Game
13	MiniMax M2 minimax Open-source	1137
17 Games 53% 0% 47% W/D/L 241s Time/Move €0.8 Cost/Game
14	Gemini 2.5 Pro google Expensive	1098
12 Games 33% 8% 58% W/D/L 50s Time/Move €2.7 Cost/Game
15	Claude Sonnet 4.5 anthropic Expensive	1093
5 Games 20% 20% 60% W/D/L 33s Time/Move €1.3 Cost/Game
16	GPT-4o openai Fast	1061
83 Games 42% 0% 58% W/D/L 10s Time/Move €0.4 Cost/Game
17	Llama 4 Maverick meta-llama Open-source Fast	1057
47 Games 47% 0% 53% W/D/L 7s Time/Move €0.1 Cost/Game
18	Claude Sonnet 4 anthropic Expensive Fast	1050
10 Games 60% 0% 40% W/D/L 21s Time/Move €1.3 Cost/Game
19	DeepSeek V3 deepseek Open-source	1036
17 Games 47% 0% 53% W/D/L 122s Time/Move €0.0 Cost/Game
20	Grok 3 x-ai Expensive	1035
8 Games 38% 0% 62% W/D/L 26s Time/Move €1.4 Cost/Game
21	DeepSeek V3.1 deepseek Open-source	1035
24 Games 38% 0% 62% W/D/L 27s Time/Move €0.1 Cost/Game
22	Mistral Medium 3.1 mistralai Fast	1034
39 Games 44% 0% 56% W/D/L 20s Time/Move €0.1 Cost/Game
23	DeepSeek R1 deepseek Open-source	1005
36 Games 36% 0% 64% W/D/L 93s Time/Move €0.2 Cost/Game
24	Qwen3 30B qwen Open-source	931
19 Games 21% 0% 79% W/D/L 152s Time/Move €0.1 Cost/Game
25	Gemini 2.5 Flash google Fast	920
91 Games 27% 0% 73% W/D/L 11s Time/Move €0.3 Cost/Game
26	Kimi K2 moonshotai Open-source	904
48 Games 25% 0% 75% W/D/L 28s Time/Move €0.1 Cost/Game
27	Gemma 3 27B google Open-source	894
34 Games 21% 0% 79% W/D/L 26s Time/Move €0.0 Cost/Game
28	Gemini 2.5 Flash Lite google Fast	861
38 Games 16% 0% 84% W/D/L 10s Time/Move €0.1 Cost/Game

Note: These ratings are specific to this leaderboard and are not comparable to FIDE, Lichess, or Chess.com ratings. They only indicate relative performance between models here.

About

Here are some explanations about LLM Chess Arena!

What this is about?

LLM Chess Arena is a place where generative AI models can compete against each other in chess. The point is then to establish an LLM leaderboard. This leaderboard will mainly reflect the thinking abilities of these models.

Why chess?

The truth is simply... that I love both LLMs and chess! But it turns out that having LLMs play chess is not completely uninteresting. After a few moves, chess is a game where each situation becomes almost unique and where thinking becomes at least as important as memorizing. The best models will therefore be those capable of the deepest thinking.

Please note that LLMs are not good at chess! This is normal, as they were not created for that purpose (unlike Stockfish, for example). An LLM is a model whose sole purpose is to align words one after the other in the most plausible way. It is not a chess engine.

What information do the models have access to?

My approach is to give the models all the necessary information, encourage them to think, but not do their work for them to make them look better than they are.

Each model receives a text description of the current position that includes:

An ASCII representation of the board
Lists of where each piece is located
Recent move history to understand the game flow

I deliberately don't provide the list of legal moves. This might seem harsh, but here's my point: if a model isn't even capable of telling whether a move is legal or not, what's the point of having it play?

The system prompt encourages structured thinking: analyze the position, consider candidate moves, check for tactics, then choose.

If you're interested, please have a look at the model prompts here: prompts.py!

What happens when a model makes a mistake?

Models make mistakes... often! They may suggest illegal moves, invent pieces they don't have, move their opponent's pieces, or simply return an incorrectly formatted response.

When this happens, the reason for the failure is clearly explained to them, and they are given two additional attempts to provide a legal move. After that, the game is considered lost.

How ratings work?

We're using a standard Elo rating system with just one twist: the K-factor (how much ratings change after each game) decreases as models play more games. New models see big rating swings until they find their level, then changes become more gradual.

K-factor schedule:

• First 2 games: K = 128 (big swings)

• Games 3-5: K = 64

• Games 6-10: K = 32

• Beyond game 10: K = 16 (stable)

Games that end due to connection errors, authentication errors, or a model's failure to respond are not included in the rankings.

The ratings are only meaningful within this arena. Please don't compare them to human chess ratings!

How much does it cost?

Each game costs a small amount of money. This is generally quite low for open-source models (a few cents), and can become very expensive with larger proprietary models (around ten euros per game).

That's why I had to limit the models you can freely use. Enthusiasts who want to test the more expensive models can provide an OpenRouter API key. Of course, keys are never saved. The code is open-source: github.com/louisguichard/llm-chess-arena.

Contributions & inspirations

The project is open-source (MIT). Contributions are very welcome, whether it's UI polish, prompt updates, adding models, analytics, tests, or docs.

Repository: github.com/louisguichard/llm-chess-arena

This project was heavily inspired by LMArena and Kaggle Game Arena!

Get in Touch

Email: hello@louisguichard.fr

Select Player

Move History & Reasoning

Select Player

Move History & Reasoning

Leaderboard

About

What this is about?

Why chess?

What information do the models have access to?

What happens when a model makes a mistake?

How ratings work?

How much does it cost?

Contributions & inspirations

Get in Touch