Overview
This project evaluates the logical reasoning capabilities of Large Language Models (LLMs) through strategic gameplay. Unlike standard benchmarks, we test models on their internal world models: they receive move logs but are never provided with current board states (FEN/SGF) or lists of legal moves. They must track the game state entirely by themselves.
Core Metrics
- Rating: A weighted Bradley-Terry model with global maximum-likelihood optimization that estimates the probability of one model defeating another based on the outcomes of their matchups. This rating includes losses due to illegal moves and syntax errors. Ratings are anchored to the
openai/gpt-oss-120b baseline at 0.0. If the active game type has no games logged for the baseline model, the ratings remain unanchored for that slice.
- Rating (Metacog): Measures epistemic calibration by comparing each player's Area Under the Receiver Operating Characteristic Curve (ROC-AUC) head-to-head. In each matchup, predictions from both models across all games played against each other are combined and randomly sampled via bootstrapping to simulate pairwise outcomes. The resulting win/loss records are then fed into the Bradley-Terry model.
- Adherence: Measures the model's ability to follow syntax rules and provide the required
<move> and <legal> tags without producing formatting errors or unexpected strings.
- Stability: Quantifies a model's consistency across different game types. It is calculated by taking the ratio of the lowest rating a model has achieved in any single game domain to its highest rating in any single game domain (min/max). A stability of 100% means it plays all games equally well, while lower scores indicate it has specialized strengths (e.g., great at Chess, bad at Go).
- Hallucinations: Specifically measures the illegal move percentage, calculated after excluding turns with syntax failures.
- ROC-AUC: Measures how well a model predicts when its internal state space is unreliable, using its predicted probability of moving legally versus the actual legality.
- RBSS (Resolution Brier Skill Score): A calibration metric that measures pure calibration ability by dividing a model's Resolution by its Uncertainty, scoring strictly how well its probabilistic confidence separates legal moves from illegal moves.
- Turns to Failure: Measures the average number of turns a model lasts before it makes an illegal move or produces an invalid response. Only includes games that ended in the model's own failure.
Matchmaking
Opponent pairings are determined dynamically by maximizing a combination of Information Value (IV) and a Directional Penalty:
- Information Value (IV): Calculates the expected informativeness of a match. It is higher for matches where the outcome is uncertain, prioritizing matches between models with similar ratings or high rating uncertainty (standard deviation). Formula:
(stddev_A + stddev_B) - abs(rating_A - rating_B)
- Directional Penalty: Rather than blindly boosting unexplored pairings, this system algorithmically curates a balanced outcome dataset. When assessing a candidate pair, we look at how many games a model has already played against opponents positioned in the same direction (higher or lower rated) as the candidate. The priority score of the match is explicitly penalized proportionally to the Base-2 logarithm of heavily-played directions, scaled by the model's native uncertainty.
Bot Prompts & Internal State
To measure true reasoning rather than pattern matching, models are not given a list of legal moves or current game states. Instead, they are prompted to keep track of the game manually and provide their own confidence in move legality (the <legal> tag). This intentionally forces the model to rely on its internal world-model. If a model hallucinates the game state, it will attempt illegal moves or be overly confident in invalid actions.
Here is what bots are told for each game type at the start of the game:
Chess
1. Remember to keep track of the game state, as you will only be provided opponents moves never an updated game state.
2. Your final chosen move MUST be enclosed in <move> tags, like <move>Nf3</move> or <move>O-O</move>. You can also <move>resign</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.
Go
1. Remember to keep track of the game state, as you will only be provided opponents moves never an updated game state. Columns are A-J (excluding I), Rows are 1-9.
2. Your final chosen move MUST be enclosed in <move> tags, like <move>D4</move>, <move>C3</move>, or <move>pass</move>. You can also <move>resign</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.
Texas Hold'em
1. Remember to keep track of the game state (including stacks across hands), as you will only be provided actions and cards, never an updated game state. Minimum bet size is the big blind. Minimum raise is the previous bet/raise amount. All-in is always legal amount
2. Your final chosen move MUST be enclosed in <move> tags. This MUST be a single integer representing exactly how many chips you are ADDING TO THE POT with this action. For example:
- If you bet 20, opponent raises to 50 and you want to re-raise to 100, use <move>80</move>, if you wanted to call instead use <move>30</move>.
- If you want to check or fold, use <move>0</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.
Model Access & Parameters
All models are accessed via the OpenRouter API to provide a standardized evaluation environment. To ensure fair and accurate metrics, the following parameters are enforced:
- Prompt Caching: Ephemeral prompt caching (
cache_control: ephemeral) is utilized on the latest system instructions to significantly reduce cost and latency during long games where the move history grows continuously.
- Reasoning Tokens: Models that support internal Chain-of-Thought (e.g., DeepSeek R1) are explicitly requested to include their reasoning traces (
include_reasoning: true).
Data Quality & Caveats
As the project has evolved, some historical matchups were played under slightly different LLM prompt formulations. These inconsistencies are a known artefact of the project's development. Outdated matchups will be discarded once sufficient match volume has been accumulated to maintain statistically reliable metrics without them.
Funding & Future Plans
I am an independent developer building this project in my free time. Running these large-scale LLMs thousands of times to achieve statistically significant ratings and confidence bounds is very expensive.
If you find this benchmark valuable, please consider supporting the project on Ko-fi. Donations go directly toward paying API costs to keep the current leaderboard up to date and to add new models to the fray.
Regardless of funding, I am committed to continuing the development of this benchmark by introducing new zero-sum environments and refining the metrics to provide the most rigorous and unsaturable evaluation of LLM reasoning possible.