The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

Abstract

The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers interpretable circuits in GPT2-Large, and produces faithful gradient-based saliency maps. LCH unifies dictionaries, probing, circuits, and saliency maps into a single geometric object grounded in the network's input-output map -- making interpretability mechanistic by construction rather than post hoc. Code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…