IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

Duc Anh Nguyen

IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

Abstract

We ask a simple question about decoder-only transformers: between which two layers is the probability of a predicted token actually produced? Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer level of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in logit space -- the softmax nonlinearity breaks additivity in probability space, precisely the quantity one usually cares about. Layer Conductance integrates gradients per layer, but attributes each to its own baseline and so does not sum to the total change in prediction. We introduce IG-Lens, a telescoping application of Integrated Gradients along a single path through the hidden states from a baseline to the final layer. Crediting each segment to the layer it terminates at yields a layer-wise attribution whose sum is exactly the change in target probability, with the softmax inside the integration path rather than linearized away. Our default estimator credits each integration step its observed change in target probability -- a prediction-aware reweighting in the spirit of IDGI -- rather than its raw gradient. Because the readout is a one-dimensional probability, this collapses each segment to a telescoping sum of endpoint values, so completeness holds exactly (to floating point) at any step count, removing Riemann discretization error while suppressing steps that show gradient sensitivity without a change in output. We give the telescoping identity and its proof, verify completeness to floating point, and describe a single-pass batched implementation computing the full token-by-layer map without any backward call. Code: https://github.com/anhnda/IGLens.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…