Coupled Query-Key Dynamics for Attention

Abstract

Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys jointly through shared learned dynamics before scoring - which we call coupled QK dynamics - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention (-6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8× higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a sample-efficiency mechanism: standard attention trained for 2.4× longer (matching wall-clock) reaches the same perplexity, but requires 2.4× more tokens. The advantage scales to 150M (-6.7\%) but narrows at 350M (-1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 -6.6\%, PubMed -4.5\%) but degrades on heterogeneous web text (+10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…