A Time and Space Efficient Algorithm for Contextual Linear Bandits
Abstract
We consider a multi-armed bandit problem where payoffs are a linear function of an observed stochastic contextual variable. In the scenario where there exists a gap between optimal and suboptimal rewards, several algorithms have been proposed that achieve O( T) regret after T time steps. However, proposed methods either have a computation complexity per iteration that scales linearly with T or achieve regrets that grow linearly with the number of contexts |X|. We propose an ε-greedy type of algorithm that solves both limitations. In particular, when contexts are variables in d, we prove that our algorithm has a constant computation complexity per iteration of O(poly(d)) and can achieve a regret of O(poly(d) T) even when |X| = (2d) . In addition, unlike previous algorithms, its space complexity scales like O(Kd2) and does not grow with T.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.