Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds
Abstract
While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ε)-th moments for some ε∈(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, Heavy-OFUL, for heavy-tailed linear bandits, achieving an instance-dependent T-round regret of O(d T1-ε2(1+ε) Σt=1T t2 + d T1-ε2(1+ε)), the first of this kind. Here, d is the feature dimension, and t1+ε is the (1+ε)-th central moment of the reward at the t-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the first computationally efficient instance-dependent K-episode regret of O(d H U* K11+ε + d H V* K). Here, H is length of the episode, and U*, V* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound (d H K11+ε + d H3 K) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.