Learning in Markovian bandits with non-observable states and constrained decision epochs

Abstract

This paper studies the problem of regret minimization in Markovian bandits with non-observable states and possibly constrained decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the performance of the learning algorithm to the best pure policy which -- akin to optimal policies of stochastic bandits -- picks the optimal arm from start to finish without ever switching. We introduce a generalization of rested Markovian bandits, self-degrading Markovian bandits, for which pure policies are always asymptotically optimal.We show that without prior knowledge on the underlying bandit, the regret of algorithms that switch arms rarely necessarily scales super-logarithmically for every bandit, i.e., as ω((T)), where T is the learning horizon. Despite the unreachability of the logarithmic regime, we design UCB-NOM, an optimistic algorithm inspired by UCB, of which the regret is nearly logarithmic. Lastly, we show that given prior knowledge on the Markovian bandit in the form of a bound on the bias functions of its arm, a proper instantiation of UCB-NOM achieves O((T)) regret. We further show that this prior knowledge allows for a O(T (T)) worst-case regret bound for UCB-NOM. Notably, our regret bounds do not depend on the number of states of the underlying Markov chains. Our findings suggest that the non-observability of states is a mild inconvenience in self-degrading Markovian bandits.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…