Dynamic Mode Decomposition along Depth in Vision Transformers

Abstract

Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately autonomous linear dynamics, admitting a single operator K applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits K from selected, consecutive hidden-state pairs and predicts p steps ahead via Kp. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans (p ≤ 4), Kp tracks an unconstrained endpoint map to within 0.02 cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank d with minimal calibration data, and across tokens, cls is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…