Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Abstract

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. jin2018q proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards without accessing a generative model. We show that the sample complexity of exploration of our algorithm is bounded by O(SAε2(1-γ)7). This improves the previously best known result of O(SAε4(1-γ)8) in this setting achieved by delayed Q-learning strehl2006pac, and matches the lower bound in terms of ε as well as S and A except for logarithmic factors.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…