Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
Abstract
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(L|X||A|T) regret with high probability, where L is the horizon, |X| is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure O(T) regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an upper occupancy bound.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.