Compressible Softmax-Attended Language under Incompressible Attention

Abstract

Softmax attention defines an interaction through dh head dimensions, but not all dimensions carry equal weight once real text passes through. We decompose the attention logit field into a learned component and a generated component and measure their spectra separately. For all 5,888 KV heads in five transformer language models (124M--7B parameters, four architecture families), the logit energy field E reaches 90\% of its variance in 2--11 singular components. The learned interaction matrix WQT WK needs 38--75 components for the same threshold out of dh ∈ 64, 128. The spectral gap is 5--25× in effective rank. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…