Reward-Mixing MDPs with a Few Latent Contexts are Learnable
Abstract
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for M=2. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary M2, we provide a sample-efficient algorithm--EM2--that outputs an ε-optimal policy using O (ε-2 · Sd Ad · poly(H, Z)d ) episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d=(2M-1,H). Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of (SA)(M) / ε2 for a general instance of RMMDP, supporting that super-polynomial sample complexity in M is necessary.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.