On the Power of Adaptivity for -Best Arm Identification in Linear Bandits

Abstract

We study the minimax sample complexity of -best arm identification in linear bandits. Given a compact action set X that spans Rd and an unknown reward vector θ∈Rd, the goal is to output an arm x∈X such that x,θ x∈X x,θ - with probability at least 1-δ, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity O\!(d(1/δ)2+w(X)22), where w(X) is a Gaussian width term dependent on X, and we prove a matching lower bound Ω\!(d(1/δ)2+w(X)22) for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the 2 ball, m-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set X for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an 2-norm estimation subroutine: we design an adaptive algorithm that uses O\!(d(1/δ)2) samples from the unit 2 ball in Rd and outputs an estimate r satisfying | r-\|θ\|2| with probability at least 1-δ, where θ is the unknown reward vector.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…