Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Abstract

This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a γ-discounted infinite-horizon Markov game with S states, where the max-player has A actions and the min-player has B actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an -approximate Nash equilibrium with a sample complexity no larger than CclippedS(A+B)(1-γ)32 (up to some log factor). Here, Cclipped is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy can be any value within (0,11-γ]. Our sample complexity bound strengthens prior art by a factor of \A,B\, achieving minimax optimality for the entire -range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…