When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Abstract

KV-cache quantization is framed as a quality--latency trade-off. We show it is inverted on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT + per-channel λ + per-group abs-max + int4 nibble pack), exposed as a HuggingFace Cache subclass, runs faster than fp16 across 256--4096-token prefixes on Gemma-3 1B (-3 to -8\% ms/tok) and at short context on Qwen2.5-1.5B (-0.7 to -2.6\% through 1K), with 3× persistent memory compression and quality preserved ( = 0.000 Qwen short-prompt; +3.6 hook Gemma). The kernel's \!25\,ns/vec overhead is below the bandwidth savings from 3× compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ( = +7975 +638.6, 12.5× reduction) at 182\,GFLOPS / D=128. Supporting findings: and are statistically indistinguishable for KV quality (we pick for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning R+λ without SRFT lowers calibration MSE 84.9\% vs 50.3\% but yields worse PPL); Householder rotations at k=d/2 reflectors are effectively lossless at d=256.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…