Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

Exact Attention Sensitivity and the Geometry of Transformer Stability

Abstract

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses N-1/4 scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the exact operator norm of the softmax Jacobian, \|Jsoftmax(u/τ)\|∞ 1 = θ(p)/τ, where the balanced-mass factor θ(p)∈[0,1] quantifies attention sensitivity. (2) We introduce a block-∞/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's N-1/4 emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, θ(p) ≈ 1 persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…