Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Abstract

We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of O((Σ_h(s,a)>0 H2 K Varch(s,a) +Σ_h(s,a)=0 H2 Varcmin + SAH4 (S H) ) K), where H is the planning horizon, S is the number of states, A is the number of actions, and K is the number of episodes. Here, h(s,a) =Vh* (a) - Qh* (s, a) represents the suboptimality gap and min := _h (s,a) > 0 h(s,a). The term Varc denotes the maximum conditional total variance, calculated as the maximum over all (π, h, s) tuples of the expected total variance under policy π conditioned on trajectories visiting state s at step h. Varc characterizes the maximum randomness encountered when learning any (h, s) pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of ( Σ_h(s,a)>0 H2 Varch(s,a)· K), demonstrating the necessity of dependence on Varc even when the maximum unconditional total variance (without conditioning on (h, s)) approaches zero.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…