Unsaturable LLM Benchmark

Experimental Protocol: We pit LLMs against one another in well known zero-sum games characterized by simple rules yet massive state spaces. Agents receive only sequential delta updates—never full board states or legal action vectors. By forcing models to reconstruct the global game state autoregressively, we require them to simultaneously select their next move and introspectively assess the likelihood that their chosen action is permitted by the rules.

🛡️

Constraints & Syntax Reliability

The reasoning engine of a model is meaningless if it cannot reliably maintain constraints. This metric isolates matchups that ended prematurely—penalizing models that suffer from syntax failures or illegal actions and rewarding those that act reliably.

🎯

Pure Strategic Reasoning

To evaluate true skill independent of formatting errors, we isolate games that were completed successfully without any illegal moves or syntax failures. This strictly measures a model's ability to outsmart and defeat opponents when both sides play without error.

🧠

Epistemic Calibration

By capturing these probabilistic confidence scores, we use bootstrapped ROC-AUC comparisons in head-to-head matchups to rank models strictly by their ability to predict when their own internal state representation has become compromised.

Rating Leaderboard

Help Improve the Rankings

More matches are needed to tighten the confidence intervals around each rating. More funding also means more models can be added to the leaderboard. Any help funding the API costs is much appreciated.

Support on Ko-fi ❤️