Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
Abstract
Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: attention demand should decrease over time as recurring patterns become familiar. We present a surprising finding from analyzing GPT-2 models: 88\% of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does not decrease during training. Motivated by this observation, we introduce (Consolidation-based Routing for Adaptive Memory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, exhibits decreasing attention utilization over training, achieving a 37.8× reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is impossible without consolidation: any static routing scheme requires (f · n) attention for tasks with recurring patterns of frequency f. On our proposed SRCD benchmark, achieves 100\% retrieval accuracy at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with 48--52\% attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology (γ = 0.43 vs.\ γhuman ≈ 0.4--0.5). Code and benchmarks are available at [anonymized].
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.