Bandits with Side Observations: Bounded vs. Logarithmic Regret

Abstract

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency ε, an extra observation is gathered by the agent for free. We prove that, no matter how small ε is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than Σi (1/ε)i, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…