Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model
Abstract
In this paper we consider the problem of computing an ε-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in O(1) time. Given such a DMDP with states S, actions A, discount factor γ∈(0,1), and rewards in range [0, 1] we provide an algorithm which computes an ε-optimal policy with probability 1 - δ where both the time spent and number of sample taken are upper bounded by \[ O[|S||A|(1-γ)3 ε2 (|S||A|(1-γ)δ ε ) (1(1-γ)ε)] ~. \] For fixed values of ε, this improves upon the previous best known bounds by a factor of (1 - γ)-1 and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing ε-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.