Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Abstract

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound independent on the planning horizon. Specifically, we consider tabular MDP with S states, A actions, a planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O(poly(S,A, K)K) regret in contrast to existing bounds which either has an additional polylog(H) dependency~zhang2020reinforcement or has an exponential dependency on S~li2021settling. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…