First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Phan Thanh Duc

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Abstract

We give the first quantitative prediction of grokking delay under AdamW. Treating the delay as a first-passage time, we derive a closed-form law Tgrok - Tmem = (1 / 2 kappaLL eta lambda) log(Vmem / Vstar), where Vt = ||thetat||2 is the squared parameter norm, Vstar is an architecture-dependent threshold, and kappaLL absorbs the AdamW correction to the clean-SGD contraction rate 2 eta lambda. Calibrating (kappaLL, Vstar) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which Vstar / Vmem stays comparatively stable within architecture (CV about 14% on the 1L transformer). First-passage of Vt is necessary but not sufficient. A quantile-margin theorem establishes that positive delay requires both norm separation Vmem > Vpost and angular reachability of a threshold alphastar = arcsin(C / VTmem(1/2)), where C is computable from the empirical NTK feature map and the validation-margin quantile. Calibrating C on modulus p=89 predicts alphastar = 47.2 degrees at p=97 (observed 47.8 degrees, error 1.3%) as a prior cross-cell prediction. Causal interventions that freeze the norm or remove weight decay at memorisation eliminate grokking (0/6 vs. 3/3 baseline), trapping the angular displacement near 12 degrees. kappaLL is empirically measured per architecture rather than derived from (beta1, beta2, epsilon); within-architecture CV stays at most 15% across four architectures, but values differ by about 2x between architectural variants beyond depth alone. Empirical scope is algorithmic tasks (modular arithmetic, sparse parity) under AdamW; whether the law transfers to natural-language scale models is open.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…