Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Abstract

This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of O(d3/2sp(v*)T) over T time steps where sp(v*) is the span of the optimal bias function v* and d is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of O(d·sp(v*)T). The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…