Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

Abstract

The curvature exponent α in hk σkα -- governing how Hessian eigenvalues scale with gradient singular values -- varies systematically across layer types (α≈ 2 for convolutions, ≈ 1 for transformer attention, < 1 for MLP up-projections). Why? We prove the Spectral Alignment Decomposition: α= 2 + dΦk / dσk, where Φk measures alignment between Kronecker factor eigenbases and gradient singular directions. This reduces &#34;why does α vary?&#34; to a geometric question we answer for LayerNorm, residual connections, and softmax heads. The decomposition implies a spectral transfer identity s = αγ linking curvature exponent, effective gradient rank-decay γ, and Hessian decay exponent s. The identity is algebraic; its empirical content is that α and γ, fit on independent data (HVPs vs. SVD), recover s to ~2% median error across 93 layers, five architectures, and three datasets -- with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer. As a proof of concept, we derive the architecture-adaptive preconditioner T(σ;α) and show that Spectral Newton -- implementing T in the gradient singular basis -- outperforms AdamW on vision benchmarks where α≈ 2.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…