Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration
Abstract
Q-value iteration (Q-VI) is usually analyzed through the \(γ\)-contraction of the Bellman operator. This argument proves convergence to \(Q*\), but it gives only a coarse account of when the induced greedy policy becomes optimal. We study discounted Q-VI as a switching system and focus on the practically optimal solution set (POSS), the set of \(Q\)-functions whose tie-broken greedy policies are optimal. The main result shows that Q-VI reaches the optimal action class in finite time by entering an invariant tube around \( X1=Q*+span( 1)\), which is contained in the POSS. For every \(>0\), the distance to \( X1\) satisfies an exponential bound with rate \((+)k\), where \(\) is the joint spectral radius of the projected switching family restricted to directions transverse to \( X1\). When \(<γ\), this transverse convergence is faster than the classical contraction rate. The analysis separates fast policy identification from the subsequent convergence to \(Q*\), which may still be governed by the all-ones mode. We also give spectral and graph-theoretic conditions under which the strict inequality \(<γ\) holds or fails.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.