Fairness in two-player zero-sum games with bandit feedback

Abstract

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least α/m. Existing instance-dependent results target pure Nash equilibria, while fairness generically produces mixed equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as p = (α/m)1 + (1-α)p with p ∈ Δm, and substituting into the payoff form yields pAq = pA q for a fair payoff matrix A := (1-α)A + α1 c, where cj = 1mΣi A(i,j) is the column-mean vector. The fair game on A is then equivalent to a standard zero-sum game on A, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to A. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most α(1-1/m) and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an O(T2/3) regret bound for an Explore-Then-Commit algorithm, Fair-ETC-TPZSG, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when p is a vertex of Δm, the bound sharpens to instance-dependent O(1/Δ(α)2), where Δ(α) is the LP-margin gap.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…