Exact Attention Sensitivity and the Geometry of Transformer Stability

Abstract

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses N-1/4 scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the exact operator norm of the softmax Jacobian, \|Jsoftmax(u/τ)\|∞ 1 = θ(p)/τ, where the balanced-mass factor θ(p)∈[0,1] quantifies attention sensitivity. (2) We introduce a block-∞/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's N-1/4 emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, θ(p) ≈ 1 persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…