Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
Abstract
Kullback-Leibler (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of reverse or forward KL. Recent studies have demonstrated ε-1-type fast rates for decision making under reverse KL regularization, in contrast to the standard ε-2-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in O(ε-2) slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first O(ε-1) upper bounds in tabular and general function approximation settings, both under notions of single-policy concentrability. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.