Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Abstract

This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, εbias, PDR-ANPG achieves a last-iterate ε optimality gap and ε constraint violation with a sample complexity of O(ε-2\ε-2,εbias-13\). If the class is incomplete (εbias>0), then the sample complexity reduces to O(ε-2) for ε<(εbias)16. Moreover, for complete policies with εbias=0, our algorithm achieves a last-iterate ε optimality gap and ε constraint violation with O(ε-4) sample complexity. It is a significant improvement over the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…