Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

Abstract

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute-ε-optimal policy, with high probability, using O(|S|4 |A|2σ2 (1-γ)6ε2 ) iterations/samples for the infinite-horizon discounted-reward MDP and O(|S|4 |A|2H6σ2 ε2 ) for the finite-horizon MDP.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…