Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Abstract

In this paper we consider the problem of learning an ε-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor γ ∈ (0,1), and an approximation threshold ε > 0, we provide a model-free algorithm to learn an ε-optimal policy with sample complexity O(SA(1/p)ε2(1-γ)5.5) (where the notation O(·) hides poly-logarithmic factors of S,A,1/(1-γ), and 1/ε) and success probability (1-p). For small enough ε, we show an improved algorithm with sample complexity O(SA(1/p)ε2(1-γ)3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…