The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Abstract
We present the first systematic study of weight matrix singular value spectra during transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: (1)~Transient Compression Waves: stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then reverses -- late layers eventually over-compress past early layers. (2)~Persistent Spectral Gradients: the power-law exponent~α develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. (3)~Q/K--V Functional Asymmetry: value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that rank and spectral shape encode fundamentally different information about training. We formalize this as a two-timescale dynamical model and derive scaling laws (α L0.26, R2=0.99). We validate on nine models across three families (custom, GPT-2, Pythia; 30M--1B parameters; 8--36 layers), demonstrate that α predicts layer importance (=0.69--0.84, p<0.02), and show that spectral-guided pruning outperforms Last-N heuristics by 1.1×--3.6× across seven models in two families (GPT-2 124M--774M, Pythia 160M--1B), with worst-vs-best gaps up to 23.7× confirming the causal role of spectral structure.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.