Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

Abstract

Attention scores in transformers are bilinear forms Sij = xi M xj / dh whose maximum magnitude governs overflow risk in low-precision training. We derive a rank-aware concentration inequality: when the interaction matrix M = WQ WK has rank r d, tail probabilities for i,j|Sij| decay as (-d2α2/(γ r)) rather than (-dα2), where γ > 1 is a typicality parameter. For transformer attention where r = dh, this yields 8--28× tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving geometry-aware scale factors that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm \|WQ WK\|2 via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…