Revisit Policy Optimization in Matrix Form
Abstract
In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as Vπ = (I - γ Pπ)-1 rπ, where Pπ is the state transition matrix given policy π and rπ is the reward signal given π. What annoys us is that Pπ and rπ are both mixed with π, which means every time when we update π, they will change together. In this paper, we leverage the notation from wang2007dual to disentangle π and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem sutton2018reinforcement and TRPO schulman2015trust can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.