Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
Abstract
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width k scales linearly with the input dimension d -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of effective width kc -- the number of learnable features at a given data budget n -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error BO scales as n1/(2β)-1, and a refinement regime in which it scales as n-1, where β>1/2 is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation BO=(kc d/n). We further show empirically that a student trained with Adam near the effective width kc achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.