FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
Abstract
Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of k consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce FibQuant, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, FibQuant traces a memory--fidelity frontier from 5× compression at 0.99 attention cosine similarity to 34× at 0.95. End-to-end on TinyLlama-1.1B, it is within 0.10 perplexity of fp16 at 4× compression and has 3.6× lower perplexity than scalar TurboQuant at b = 2 (8× compression), where scalar random-access quantization begins to fail.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.