Asymptotically optimal regret in communicating Markov decision processes
Abstract
In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process M, our algorithm has regret K(M) (T) + o((T)) where T is the number of learning steps and K(M) is the best possible constant. This algorithm works by explicitly tracking the constant K(M) to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function K(M) is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate K(M) with arbitrary precision from empirical data.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.