Non-Stationary Restless Multi-Armed Bandits with Provable Guarantee
Abstract
Online restless multi-armed bandits (RMABs) typically assume that each arm follows a stationary Markov Decision Process (MDP) with fixed state transitions and rewards. However, in real-world applications like healthcare and recommendation systems, these assumptions often break due to non-stationary dynamics, posing significant challenges for traditional RMAB algorithms. In this work, we specifically consider N-armd RMAB with non-stationary transition constrained by bounded variation budgets B. Our proposed \; algorithm integrates sliding window reinforcement learning (RL) with an upper confidence bound (UCB) mechanism to simultaneously learn transition dynamics and their variations. We further establish that \; achieves O(N2 B14 T34) regret bound by leveraging a relaxed definition of regret, providing a foundational theoretical framework for non-stationary RMAB problems for the first time.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.