Sliding-Window Thompson Sampling for Non-Stationary Settings

Abstract

Non-stationary multi-armed bandits (NS-MABs) model sequential decision-making problems in which the expected rewards of a set of actions, a.k.a.~arms, evolve over time. In this paper, we fill a gap in the literature by providing a novel analysis of Thompson sampling-inspired (TS) algorithms for NS-MABs that both corrects and generalizes existing work. Specifically, we study the cumulative frequentist regret of two algorithms based on sliding-window TS approaches with different priors, namely Beta-SWTS and γ-SWGTS. We derive a unifying regret upper bound for these algorithms that applies to any arbitrary NS-MAB (with either Bernoulli or subgaussian rewards). Our result introduces new indices that capture the inherent sources of complexity in the learning problem. Then, we specialize our general result to two of the most common NS-MAB settings: the abruptly changing and the smoothly changing environments, showing that it matches state-of-the-art results. Finally, we evaluate the performance of the analyzed algorithms in simulated environments and compare them with state-of-the-art approaches for NS-MABs.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…