Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization

Abstract

Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws Cl = ||Wl+1||F2 - ||Wl||F2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as etaalpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients ck proportional to ek(0)2 * lambdax,k2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross-entropy self-regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width-dependent transition: a perturbative sub-Edge-of-Stability regime where the spectral formula applies, and a non-perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…