Batched Stochastic Linear Bandits with 1-Bit Communication Constraints

Abstract

We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size B, and during each batch the learner sends B requested arm pulls to an agent, who then observes the corresponding B rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all previously received bits but not on any past rewards directly. This setting addresses a significant yet unexplored ``middle ground'' between previous models having per-round quantization only or total bit budgets only. We establish a minimax lower bound showing that Ω(B\d, A \) regret is unavoidable due to the 1-bit communication bottleneck, even in the absence of noise. Combined with standard statistical limits, this yields a general lower bound of Ω(B\d, A \ + dT \d, A \). We develop two phased-elimination algorithms based on G-optimal designs and 1-bit mean estimation. The first achieves O(dB + dT) regret, matching the lower bound up to logarithmic factors when A = (Ω(d)), and the second incorporates a safe-arm identification and warm-start procedure to obtain O(B A + d3/2B + dT A ) regret, which is near-optimal in broad scaling regimes of ( A , B, d, T). Together, our results demonstrate that a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes, even for batch sizes as large as Θ(T).

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…