Efficiently Solving MDPs with Stochastic Mirror Descent

Abstract

We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with Atot total state-action pairs and mixing time bound tmix our method computes an ε-optimal policy with an expected O(tmix2 Atot ε-2) samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a γ-discounted MDP with Atot total state-action pairs our method computes an ε-optimal policy with an expected O((1-γ)-4 Atot ε-2) samples, matching the previous state-of-the-art up to a (1-γ)-1 factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…