How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
Abstract
We measure how much one recurrence is worth to a looped (depth-recurrent) transformer, in equivalent unique parameters. From an iso-depth pretraining sweep across recurrence counts r ∈ \1, 2, 4, 8\ spanning 50× in training compute, we fit a joint scaling law L = E + A\,(Nonce + r Nrec)-α + B\,D-β and measure a recurrence-equivalence exponent = 0.46. Intuitively, tells us whether looping a block r times is equivalent in validation loss to r unique blocks of a non-looped model (full equivalence, =1) or to a single block run repeatedly with no capacity gain (=0). Our = 0.46 sits in between, so replacing unique blocks with shared recurrences increases validation loss at matched training compute. For example, at r=4 a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of as a diagnostic tool on two case studies: commonly used truncated backpropagation lowers to 0.38, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise to 0.65, a genuine capacity gain. Our method separates true loop improvements from training-side gains, a distinction raw validation loss cannot make.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.