Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition
Abstract
Extracting sparse circuits from billion-parameter transformers is constrained by O(2n) search cost and pervasive feature reuse across co-active pathways. Hierarchical Attribution Graph Decomposition (HAGD) addresses this through four stages: cross-layer transcoder training, spectral coarsening of attribution graphs, graph-neural-network (GNN)-guided hierarchical traversal, and causal intervention verification, reducing worst-case complexity to O(n2 n). Per-layer transcoders trained on the RedPajama corpus yield monosemantic dictionaries; gradient-activation products form weighted attribution graphs; normalized-Laplacian spectral clustering builds multi-resolution hierarchies; an attention-based GNN assigns circuit-membership scores at successive coarsening stages. Evaluation spans GPT-2 (117M-774M), Pythia (1.4B-6.9B), and Llama (7B-70B) across modular arithmetic, parity computation, integer sorting, coreference resolution (WinoGrande), commonsense reasoning (HellaSwag), and factual recall. Behavioral preservation reaches 91\% (2.3\%) on modular arithmetic with 49-347-node circuits, while ACDC exhausts memory beyond 1.4B parameters. Cross-architecture transfer coefficients span 0.38-0.82, with within-family pairs (Llama-7B Llama-70B) attaining 0.82. Limitations include omitted attention-head circuits, 15-20\% unexplained reconstruction variance, ablation-based validation circularity, and uncertain interpretability of circuits exceeding several hundred nodes.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.