Spectral Condition for μP under Width-Depth Scaling
Abstract
Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization (μP) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for μP under joint width-depth scaling. For deep residual networks whose residual blocks contain k transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from k=1 to k≥ 2, unifying previously disparate μP formulations and identifying the k≥ 2 case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing μP across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the μP formulation derived from the k≥ 2 case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and μP in the k=1 case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.