Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs

Abstract

In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve O(\dK, d1.5Σk=1K σk2\ + d2) where d is the dimension of the features, K is the time horizon, and σk2 is the noise variance at time step k, and O ignores polylogarithmic dependence, which is a factor of d3 improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in [0,1], we achieve a horizon-free regret bound of O(d K + d2) where d is the number of base models and K is the number of episodes. This is a factor of d3.5 improvement in the leading term and d7 in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…