Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

Abstract

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains why. We develop a diagnostic framework with two tools, ranking consistency s and off-diagonal correlation off, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and rcrit 1/L; and (2)shared reliance on pairwise similarity signals whose ranking consistency degrades from s=0.88 to 0.27 in deep layers. Pairwise rankings are inherently unstable (O(Np2) joint perturbations) while unary signals enjoy greater stability (O(Np) perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…