Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Vaneet Aggarwal

Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Abstract

We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Ces\`aro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace K(P) of the transition matrix P as the source of this ambiguity. Quotienting by K(P) is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection with kernel K(P), the reward admits a unique decomposition r = g + (I-P)v, where g is a persistent regime profile and v is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in g and v is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…