Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

Abstract

Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer Gradient Bloats and undervalues late-layer Hidden Heroes. Rank correlation collapses from = 0.72 on sequence reversal to 0.27 on sequence sorting, reaching = -0.18 in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes 14× greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy (-36.4\% 22.8\%). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…