Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher
Abstract
Memorization in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: anti-grokking, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source WeightWatcher tool based on HTSR/SETOL theory. The primary signal is the emergence of Correlation Traps: anomalously large eigenvalues beyond the Marchenko--Pastur bulk in the empirical spectral density of shuffled weight matrices, which are predicted to impair generalization. As a secondary signal, anti-grokking corresponds to the average HTSR layer quality metric α deviating from 2.0. Neither metric requires access to the test or training data. We compare these signals to alternative grokking diagnostic, including 2 norms, Activation Sparsity, Absolute Weight Entropy, and Local Circuit Complexity. These track pre-grokking and grokking but fail to identify anti-grokking. Finally, we show that Correlation Traps can induce catastrophic forgetting and/or prototype memorization, and observe similar pathologies in large-scale LLMs, like OSS GPT 20/120B.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.