Linear Q-Learning Does Not Diverge in L2: Convergence Rates to a Bounded Set

Abstract

Q-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that Q-learning with linear function approximation (i.e., linear Q-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear Q-learning. Building on this success, this paper further establishes the first L2 convergence rate of linear Q-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear Q-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an ε-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the L2 convergence rate of tabular Q-learning with an ε-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…