Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Abstract

Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only O( N) dimensions to distinguish among N relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), which must be designed into the architecture before pretraining. We factorize each key projection WK ≈ Ad × r Br × d via truncated singular value decomposition (SVD) (where r is the chosen compression dimension), set WK' = A as the new key projection producing compact r-dimensional keys for the cache, and absorb B into the query projection (WQ' = WQ B) at zero cost -- since queries are never cached. At the 7B scale, training from scratch with r = d/4 (where d is the model dimension) matches full-attention perplexity (9.24 vs 9.25 PPL after 20B tokens, mean over two seeds) while using 12% fewer parameters and training 8% faster. For existing models, SVD followed by QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at roughly 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to 16× combined key cache compression. For a 7B model serving a 128K context, factored keys save 25 GB of KV cache per user, enabling roughly 60% more concurrent users on identical hardware.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…