Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Abstract

Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the q-function of an infinite-horizon γ-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise ε-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of O(|S||A|(1-γ)-4ε-2), where S and A denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size δ, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…