A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle
Abstract
Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is O(| S|2| A|2 (1-γ)-5-2). Furthermore, we show that this sample complexity is improved to O(| S|| A| (1-γ)-5-2) if we can sequentially update all state-action pairs and O(| S|| A| (1-γ)-4-2) if γ is further in (1/2, 1). Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.