Batched Stochastic Linear Bandits with 1-Bit Communication Constraints
Abstract
We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size B, and during each batch the learner sends B requested arm pulls to an agent, who then observes the corresponding B rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all previously received bits but not on any past rewards directly. This setting addresses a significant yet unexplored ``middle ground'' between previous models having per-round quantization only or total bit budgets only. We establish a minimax lower bound showing that Ω(B\d, A \) regret is unavoidable due to the 1-bit communication bottleneck, even in the absence of noise. Combined with standard statistical limits, this yields a general lower bound of Ω(B\d, A \ + dT \d, A \). We develop two phased-elimination algorithms based on G-optimal designs and 1-bit mean estimation. The first achieves O(dB + dT) regret, matching the lower bound up to logarithmic factors when A = (Ω(d)), and the second incorporates a safe-arm identification and warm-start procedure to obtain O(B A + d3/2B + dT A ) regret, which is near-optimal in broad scaling regimes of ( A , B, d, T). Together, our results demonstrate that a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes, even for batch sizes as large as Θ(T).
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.