P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=28

Abstract

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~P is cast to FP8 before the P · V matrix multiplication. We analyze two implementation choices that affect output precision under the Attention Sink phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to P before casting. We show that forward KV iteration causes P-collapse -- to leading order a fraction Φ(Δ+ δk - 6.93 - S) of non-sink P values underflow to zero, where the small shift δk ≈ 1 (for ksink=4) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with S=256. We further give a constructive characterization of S = 256 = 28 as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function dp(S) over the E4M3 number line (dp = 2-4, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage among bit-exact (2k) scales (a non-bit-exact scale such as 448 attains slightly higher coverage; sec.5). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of why these choices are good and a closed-form threshold Δc = 6.93 + S - δk for predicting kernel-level precision loss. Kernel-faithful experiments (Q, K, V in FP32 to isolate the P-cast effect) show 3-10× MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from S=1 to S=256.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…