On the Sample Complexity of Reinforcement Learning with a Generative Model

Abstract

We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(N(N/δ)/((1-γ)3ε2)) samples are required to find an ε-optimal estimation of the action-value function with the probability 1-δ. We also prove a matching lower bound of (N(N/δ)/((1-γ)3ε2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-) value function in which the upper bound matches the lower bound of RL in terms of N, ε, δ and 1/(1-γ). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-γ).

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…