Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings
Abstract
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE |Qπ-Qπ|<ε is a stronger measure than the point-wise OPE and ensures offline learning when contains all policies (the global class). In this paper, we establish an (H2 S/dmε2) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of O(H2/dmε2) for the local uniform convergence that applies to all near-empirically optimal policies for the MDPs with stationary transition. Here dm is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate O(H2/dmε2) is our design of singleton absorbing MDP, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity O(H2(K)/dmε2) (K is the number of tasks) and O(H2S/dmε2) respectively. These results provide a unified solution for simultaneously solving different offline RL problems.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.