Policy Optimization with Stochastic Mirror Descent

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ε-3) sample trajectories to achieve an ε-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…