Model-Free Robust Average-Reward Reinforcement Learning with Sample Complexity Analysis

Abstract

Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its training dynamics. However, most existing studies focus on model-based settings and provide only asymptotic guarantees, hindering their principled understanding and practical deployment, especially in data-limited scenarios. We aim to close this gap by proposing a model-free algorithm, Robust Halpern Iteration (RHI). We first design our algorithm based on a black-box sampling oracle, which can estimate the worst-case performance accurately. We then derive the finite sample complexity of RHI under the generative model setting, assuming the sampling oracle. To concretely design such an oracle, we propose a K-order multi-level Monte-Carlo estimator, which is shown to have a lower bias compared to prior methods. We further instantiate our design for multiple uncertainty models, including KL and χ2 divergence sets, and show that our RHI algorithm achieves an -optimal robust policy with a sample complexity of O( SAH2(2+o(1))), where S,A are the number of states and actions, and H is the robust optimal span. Our result asymptotically matches the best complexity in robust average reward RL.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…