Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Abstract

We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves O(S5A2sp(h*)T) regret after T steps, where S× A is the size of state-action space, and sp(h*) the span of the optimal bias function. Our results are the first to achieve optimal dependence in T for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an ε-optimal policy using O (SAsp2(h*)ε2+S2Asp(h*)ε ) samples, whereas the minimax lower bound is (SAsp(h*)ε2). Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity O(SA).

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…