Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
Abstract
We propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the product qp · qs, where qp ranges over the 24-element Hurwitz group 2T (the 24 vertices of the 24-cell on S3, pairwise angle 60) and qs ranges over a per-(layer, head) secondary codebook of S random unit quaternions. The multiplicative composition yields 24S effective codewords at S stored parameters; random initialization suffices because left-multiplication is an S3 isometry, so seeded codebooks vary in end-task ppl by <1.5\%. A per-batch median-multiplier outlier extraction step (C=3, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within 0.02--0.03 ppl points at 5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to 104+ ppl, HQMQ + Med3× recovers fp16 quality within 0.02--0.10 ppl points at 5 bits. HQMQ Pareto-dominates naive int by 3--1900× at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at 3.79 bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at 3.79 bits matches KIVI-4 ( 4.5 bits) within 1 pt on CoQA, 0.6 pts on TruthfulQA, and 2.3 pts on GSM8K, at 16\% fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to 5.05× KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.