Context-lumpable stochastic bandits

Csaba Szepesvári

Context-lumpable stochastic bandits

Abstract

We consider a contextual bandit problem with S contexts and K actions. In each round t=1,2,…, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r \S,K\ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most O(r (S +K )/ε2) samples with high probability and provide a matching (r(S+K)/ε2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by O(r3(S+K)T). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and O(poly(r)(S+K)T) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…