On the Residual Scaling of Looped Transformers: Stability and Transferability

Jian Li

On the Residual Scaling of Looped Transformers: Stability and Transferability

Abstract

Looped (weight-tied) Transformers apply a shared residual block N times (h ← h + \,f(h), same f at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe = 1/\!L for depth-L residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling = 1/N. For multi-layer blocks (L unique layers looped N times), we derive a factored parameterization = λ/(N\!L) that separates the two sources of growth: 1/N controls the within-layer loop correlation, and 1/\!L controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers L, not on the loop count N, enabling direct hyperparameter transfer from small to large N without retuning. Experiments on looped Transformers confirm that 1/N scaling improves trainability and yields better loss than 1/\!N scaling across loop counts.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…