Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning
Abstract
We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels δ∈ (0,1], thereby characterizing the regret distribution across the full range of δ. We present a simple UCBVI-style algorithm with exploration bonus \c1,k/N, c2,k/N\, where N denotes the visit count and (c1,k,c2,k) are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with A arms and horizon T, we obtain a distributional regret bound of order O(AT(1/δ)), confirming the conjecture of Lattimore & Szepesvári (2020, Section 17.1) for the first time.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.