On the Residual Scaling of Looped Transformers: Stability and Transferability
Abstract
Looped (weight-tied) Transformers apply a shared residual block N times (h ← h + \,f(h), same f at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe = 1/\!L for depth-L residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling = 1/N. For multi-layer blocks (L unique layers looped N times), we derive a factored parameterization = λ/(N\!L) that separates the two sources of growth: 1/N controls the within-layer loop correlation, and 1/\!L controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers L, not on the loop count N, enabling direct hyperparameter transfer from small to large N without retuning. Experiments on looped Transformers confirm that 1/N scaling improves trainability and yields better loss than 1/\!N scaling across loop counts.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.