Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games
Abstract
Computing approximate Nash equilibria in multi-player general-sum Markov games is a computationally intractable task. However, multi-player Markov games with certain cooperative or competitive structures might circumvent this intractability. In this paper, we focus on multi-player zero-sum polymatrix Markov games, where players interact in a pairwise fashion while remain overall competitive. To the best of our knowledge, we propose the first policy optimization algorithm called Entropy-Regularized Optimistic-Multiplicative-Weights-Update (ER-OMWU) for finding approximate Nash equilibria in finite-horizon zero-sum polymatrix Markov games with full information feedback. We provide last-iterate convergence guarantees for finding an ε-approximate Nash equilibrium within O(1/ε) iterations, which is near-optimal compared to the optimal O(1/ε) iteration complexity in two-player zero-sum Markov games, which is a degenerate case of zero-sum polymatrix games with only two players involved. Our algorithm combines the regularized and optimistic learning dynamics with separated smooth value update within a single loop, where players update strategies in a symmetric and almost uncoupled manner. It provides a natural dynamics for finding equilibria and is more probable to be adapted to a sample-efficient and fully decentralized implementation where only partial information feedback is available in the future.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.