Replicable Bandits with UCB based Exploration
Abstract
We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is ρ-replicable if two executions using shared internal randomness but independent reward realizations produce the same action sequence with probability at least 1-ρ. Prior approaches to this problem are elimination-based and, in linear bandits with infinitely many actions, rely on discretization, leading to suboptimal dependence on the dimension d and ρ. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret O\!(K22 Tρ2Σa:Δa>0(Δa+(KT T)Δa)). For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a ρ-replicability guarantee. Beyond its role in our bandit algorithm, this may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by O\!((d+d3ρ)T). This improves the best prior regret guarantee by a factor of O(d/ρ), showing that our optimistic algorithm can substantially reduce the price of replicability. This is the first linear-bandit algorithm with an optimal dependence on ρ for large number of arms. Finally, we extend our framework to stochastic generalized linear bandits by developing RepGLM, a replicable penalized GLM estimator, and RepGLMUCB, a replicable optimistic algorithm for this setting.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.