On the Optimizer Dependence of Neural Scaling Laws

Abstract

The scaling exponent α in neural scaling laws L(N) N-α is commonly treated as a fixed constant set by architecture and data. We present evidence that α depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure α across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger α), with the α-shift increasing across most of the tested spectral range, peaking near s = 1.5, and remaining large at s = 2.0. At s ≈ 1.0 (characteristic of natural language), the full natural gradient achieves α≈ 0.31 versus α≈ 0.12 for gradient descent -- a 2.6× larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…