On Frequentist Regret of Linear Thompson Sampling

Abstract

This paper studies the stochastic linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in Rd and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of T decisions. Linear Thompson Sampling (LinTS) is a popular Bayesian heuristic, supported by theoretical analysis that shows its Bayesian regret is bounded by O(dT), matching minimax lower bounds. However, previous studies demonstrate that the frequentist regret bound for LinTS is O(ddT), which requires posterior variance inflation and is by a factor of d worse than the best optimism-based algorithms. We prove that this inflation is fundamental and that the frequentist bound of O(ddT) is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve minimax optimal frequentist regret, under additional conditions. Our analysis provides new insights into LinTS and settles an open problem in the field.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…