Asymptotically Optimal Problem-Dependent Bandit Policies for Transfer Learning
Abstract
We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N'k i.i.d. samples from each source distribution nu'k, and the true target distributions nuk lie within a known distance bound dk(nuk, nu'k) <= Lk. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (dk, Lk, N'k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.