Unsaturable LLM Benchmark

Experimental Protocol: We pit LLMs against one another in well known zero-sum games characterized by simple rules yet massive state spaces. Agents receive only sequential delta updates—never full board states or legal action vectors. By forcing models to reconstruct the global game state autoregressively, we require them to simultaneously select their next move and introspectively assess the likelihood that their chosen action is permitted by the rules.

🛡️

Constraints & Syntax Reliability

The reasoning engine of a model is meaningless if it cannot reliably maintain constraints. This metric isolates matchups that ended prematurely—penalizing models that suffer from syntax failures or illegal actions and rewarding those that act reliably.

🎯

Pure Strategic Reasoning

To evaluate true skill independent of formatting errors, we isolate games that were completed successfully without any illegal moves or syntax failures. This strictly measures a model's ability to outsmart and defeat opponents when both sides play without error.

🧠

Epistemic Calibration

By capturing these probabilistic confidence scores, we use bootstrapped ROC-AUC comparisons in head-to-head matchups to rank models strictly by their ability to predict when their own internal state representation has become compromised.

Rating Leaderboard

Help Improve the Rankings

More matches are needed to tighten the confidence intervals around each rating. More funding also means more models can be added to the leaderboard. Any help funding the API costs is much appreciated.

Support on Ko-fi ❤️

Model Performance Metrics

Select "Combined" for global results or a specific game for detailed breakdown

Loading data...

Game Records Log

Examine raw model thoughts and actions from recent match-ups

Project Documentation

Understanding the metrics and evaluation framework

Overview

This project evaluates the logical reasoning capabilities of Large Language Models (LLMs) through strategic gameplay. Unlike standard benchmarks, we test models on their internal world models: they receive move logs but are never provided with current board states (FEN/SGF) or lists of legal moves. They must track the game state entirely by themselves.

Core Metrics

Rating: A weighted Bradley-Terry model with global maximum-likelihood optimization that estimates the probability of one model defeating another based on the outcomes of their matchups. This rating includes losses due to illegal moves and syntax errors. Ratings are anchored to the openai/gpt-oss-120b baseline at 0.0. If the active game type has no games logged for the baseline model, the ratings remain unanchored for that slice.
Rating (Metacog): Measures epistemic calibration by comparing each player's Area Under the Receiver Operating Characteristic Curve (ROC-AUC) head-to-head. In each matchup, predictions from both models across all games played against each other are combined and randomly sampled via bootstrapping to simulate pairwise outcomes. The resulting win/loss records are then fed into the Bradley-Terry model.
Adherence: Measures the model's ability to follow syntax rules and provide the required <move> and <legal> tags without producing formatting errors or unexpected strings.
Stability: Quantifies a model's consistency across different game types. It is calculated by taking the ratio of the lowest rating a model has achieved in any single game domain to its highest rating in any single game domain (min/max). A stability of 100% means it plays all games equally well, while lower scores indicate it has specialized strengths (e.g., great at Chess, bad at Go).
Hallucinations: Specifically measures the illegal move percentage, calculated after excluding turns with syntax failures.
ROC-AUC: Measures how well a model predicts when its internal state space is unreliable, using its predicted probability of moving legally versus the actual legality.
RBSS (Resolution Brier Skill Score): A calibration metric that measures pure calibration ability by dividing a model's Resolution by its Uncertainty, scoring strictly how well its probabilistic confidence separates legal moves from illegal moves.
Turns to Failure: Measures the average number of turns a model lasts before it makes an illegal move or produces an invalid response. Only includes games that ended in the model's own failure.

Matchmaking

Opponent pairings are determined dynamically by maximizing a combination of Information Value (IV) and a Directional Penalty:

Information Value (IV): Calculates the expected informativeness of a match. It is higher for matches where the outcome is uncertain, prioritizing matches between models with similar ratings or high rating uncertainty (standard deviation). Formula: (stddev_A + stddev_B) - abs(rating_A - rating_B)
Directional Penalty: Rather than blindly boosting unexplored pairings, this system algorithmically curates a balanced outcome dataset. When assessing a candidate pair, we look at how many games a model has already played against opponents positioned in the same direction (higher or lower rated) as the candidate. The priority score of the match is explicitly penalized proportionally to the Base-2 logarithm of heavily-played directions, scaled by the model's native uncertainty.

Bot Prompts & Internal State

To measure true reasoning rather than pattern matching, models are not given a list of legal moves or current game states. Instead, they are prompted to keep track of the game manually and provide their own confidence in move legality (the <legal> tag). This intentionally forces the model to rely on its internal world-model. If a model hallucinates the game state, it will attempt illegal moves or be overly confident in invalid actions.

Here is what bots are told for each game type at the start of the game:

Chess

1. Remember to keep track of the game state, as you will only be provided opponents moves never an updated game state.
2. Your final chosen move MUST be enclosed in <move> tags, like <move>Nf3</move> or <move>O-O</move>. You can also <move>resign</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.

Go

1. Remember to keep track of the game state, as you will only be provided opponents moves never an updated game state. Columns are A-J (excluding I), Rows are 1-9.
2. Your final chosen move MUST be enclosed in <move> tags, like <move>D4</move>, <move>C3</move>, or <move>pass</move>. You can also <move>resign</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.

Texas Hold'em

1. Remember to keep track of the game state (including stacks across hands), as you will only be provided actions and cards, never an updated game state. Minimum bet size is the big blind. Minimum raise is the previous bet/raise amount. All-in is always legal amount
2. Your final chosen move MUST be enclosed in <move> tags. This MUST be a single integer representing exactly how many chips you are ADDING TO THE POT with this action. For example:
- If you bet 20, opponent raises to 50 and you want to re-raise to 100, use <move>80</move>, if you wanted to call instead use <move>30</move>.
- If you want to check or fold, use <move>0</move>.
3. A percentage estimate (0-100) of how likely your move is a legal move MUST be enclosed in <legal> tags, like <legal>57</legal>.

Model Access & Parameters

All models are accessed via the OpenRouter API to provide a standardized evaluation environment. To ensure fair and accurate metrics, the following parameters are enforced:

Prompt Caching: Ephemeral prompt caching (cache_control: ephemeral) is utilized on the latest system instructions to significantly reduce cost and latency during long games where the move history grows continuously.
Reasoning Tokens: Models that support internal Chain-of-Thought (e.g., DeepSeek R1) are explicitly requested to include their reasoning traces (include_reasoning: true).

Data Quality & Caveats

As the project has evolved, some historical matchups were played under slightly different LLM prompt formulations. These inconsistencies are a known artefact of the project's development. Outdated matchups will be discarded once sufficient match volume has been accumulated to maintain statistically reliable metrics without them.

Funding & Future Plans

I am an independent developer building this project in my free time. Running these large-scale LLMs thousands of times to achieve statistically significant ratings and confidence bounds is very expensive.

If you find this benchmark valuable, please consider supporting the project on Ko-fi. Donations go directly toward paying API costs to keep the current leaderboard up to date and to add new models to the fray.

Regardless of funding, I am committed to continuing the development of this benchmark by introducing new zero-sum environments and refining the metrics to provide the most rigorous and unsaturable evaluation of LLM reasoning possible.