Pruning Deep Neural Networks via the Marchenko--Pastur Distribution
Abstract
We study a Marchenko--Pastur (MP) random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component R has small propagated logit effect Ls \| R ψ1(s) \|∞, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive L2-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge σ+ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 2:4+ToMe reaches 83.41\% top-1 (-1.70 pp from dense) at 59.81\% sparse-execution MAC reduction, with 1.388× best-observed A40 native-2:4 backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives 2.705×. At structured sparsity, ViT-B/16 6:12 reaches 83.74\%, ViT-L/16 8:16 dense+permutation reaches 85.33\% (-0.51 pp), and ConvNeXtV2-Base 12:16 reaches 86.35\% (-0.37 pp). For CNNs, ResNet50 8:16 dense+permutation reaches 75.87\% (-0.26 pp), and ResNet152d CAST-conv+permutation reaches 81.33\% (-1.53 pp) at 50\% MAC accounting with a 1.62× A40 im2col+2:4 sparse-GEMM audit.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.