Attention is Just Another Name for Coupling?: A Fast-Slow ODE Perspective on Hierarchical Pretraining
Abstract
Causal self-attention is a coupling mechanism: each token's hidden state is updated by a learned mixture of preceding tokens at the same timescale. This paper asks whether a second, temporally slower coupling-a slow sub-system operating on a temporally-downsampled view of the sequence and fed back into the fast path through a zero-initialised gate-complements it. The question is framed in the language of singularly perturbed ordinary differential equations (ODEs), where the fast variable x evolves at the token rate, the slow variable y evolves at one update per P tokens, and the timescale ratio = 1/P is enforced structurally by causal block-mean pooling. The paper instantiates the fast-slow ODE formalism as a concrete neural network: a fast path of standard causal attention over T tokens, a slow path of full attention over T/P pooled tokens (P2 × cheaper per layer), and a zero-initialised additive gate. In addition, under a linear-generator assumption on the fast dynamics, we prove that the equilibrium manifold x = ϕ(y) is exactly the master-equation (ME) stationary distribution pst(y); in that regime a learned MLP ϕθ(y) is a variational approximation of it (the trained block is not a generator, so this identity is the structured limit, not a claim about the network as trained). Empirically, at 500k tokens the coupling is neutral -- the gate stays closed and the coupled and frozen ablations are within run-to-run noise -- at a wall-clock cost comparable to a dense baseline. The contribution is the precise, gap-marked mapping itself, not a performance gain.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.