Variance-reduced Q-learning is minimax optimal

Abstract

We introduce and analyze a form of variance-reduced Q-learning. For γ-discounted MDPs with finite state space X and action space U, we prove that it yields an ε-accurate estimate of the optimal Q-function in the ∞-norm using O ((D ε2 (1-γ)3 ) \; ( D(1-γ) ) ) samples, where D = |X| × |U|. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary Q-learning has worst-case quartic scaling in the discount complexity.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…