Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

Abstract

We analyse three KV cache quantization schemes under a fair bit budget: KV (scalar MSE baseline), KQV (WHT + MSE on K; WHT + MSE + QJL on V), and QKQV (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on K inflates inner product variance by π/2, which softmax amplifies nonlinearly via Jensen's inequality, and we present statistical inference and information metrics to highlight practical differences. Three empirical findings emerge. (1)~At n=4 (the practically dominant budget), KQV wins on every measure -- KL divergence, geometric K error, and 6D distance -- across all distributions and ranks tested. (2)~The K--V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better geometric K reconstruction at n ∈ \2,3,5\, KQV at n ∈ \4,6\, invariant to rank and tail weight -- an open rate-distortion problem. KL(pref \| pquant), K-only by construction, bridges K direction error to routing corruption and output collapse. We present a sufficient condition when the Jensen mechanism amplifies superlinearly through the softmax. At n ∈ \2,3,5\, QKQV wins geometrically because this assumption does not bind. At n=4, elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover, providing a new perspective and explanation.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…