Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
Abstract
We introduce a finite-size gradient-transport framework for real language-model training, based on five observables (D,z,β,δ,vrel) that separate cascade size, duration, absolute transport, and intensive transport efficiency. We analyze direct raw-gradient measurements from Pico-LM across four scales and 125 aligned steps, together with a five-scale Pythia companion dataset built from 153 aligned checkpoint-difference update fields. The same algebraic closure holds in both families, and both share a near-unity cascade-size backbone, but they occupy distinct transport regimes: Pico-LM shows positive duration scaling and negative intensive-efficiency scaling, whereas Pythia remains near the D=1 baseline with only weak positive efficiency scale dependence. Randomized-field controls give nearly matched null floors in the intensive and duration channels, indicating that the contrast reflects different real departures from a shared null skeleton rather than different null calibrations. The families also differ in stepwise power-law compressibility: Pico-LM retains clean duration and efficiency power laws, whereas Pythia preserves the size backbone but shows weaker one-slope compressibility in those channels. External performance associations are correspondingly channel-level, carried mainly by vrel and normalized cascade duration, while D(t) acts as a shared size backbone without a significant exponent-level performance association. These results support a reusable transport measurement framework without claiming a universal fixed point or a first-principles derivation of neural scaling laws.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.