Efficient Q-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Abstract
We present a non-asymptotic convergence analysis of Q-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust Q operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust Q-function using O(ε-2) samples. We also provide an efficient routine for robust Q-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an ε-optimal robust policy within O(ε-2) samples. We provide numerical simulations to evaluate the performance of our algorithms.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.