Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs
Abstract
Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as W ≈ Q + LR, using a rank-r correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when W has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-k singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank r-k for error reconstruction. We derive a theory-guided criterion for selecting k by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting Q + LR parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT. The project page is available at https://ai-isl.github.io/srr.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.