Dynamic Mode Decomposition along Depth in Vision Transformers
Abstract
Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately autonomous linear dynamics, admitting a single operator K applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits K from selected, consecutive hidden-state pairs and predicts p steps ahead via Kp. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans (p ≤ 4), Kp tracks an unconstrained endpoint map to within 0.02 cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank d with minimal calibration data, and across tokens, cls is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.