Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models
Abstract
Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous 2×2, 3×3, and 5×5 payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on 2×2 and 3×3 games, supervised fine-tuning raises unseen 5×5--7×7 success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is 2-Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve 3×3 games embedded in much larger matrices, while random-padded controls fail and dense 12×12 games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.