Dimensional Criticality at Grokking Across MLPs and Transformers
Abstract
Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce TDU--OFC (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a macroscopic observable -- the time-resolved effective cascade dimension D(t) -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline D=1 precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through D=1 (approaching from D>1), while XOR ascends (from D<1). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near D ≈ 1. Negative controls confirm this picture: ungrokked runs remain supercritical (D>1) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from D(t). Shadow-probe controls (αtrain=0) confirm that D(t) is non-invasive, and grokked trajectories diverge from ungrokked ones in D(t) some 100--200 epochs before the behavioral transition.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.