Near-Optimal Regret for Policy Optimization in Contextual MDPs with General Offline Function Approximation
Abstract
We introduce OPO-CMDP, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of O(H4T|S||A|(|F||P|)), where S and A denote the state and action spaces, H the horizon length, T the number of episodes, and F, P the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on |S| and |A|, directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.