Attribution Graphs and Causal Probing for Mechanistic Discovery and Bias Repair in Multimodal Generative Learning
Abstract
We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce Attribution Graphs (AGs), which extend GradCAM++ to circuit-level representations, and Causal Probing, a do-calculus intervention method for identifying causal latent structures, enabling detection and correction of spurious correlations, demographic biases, and misaligned decision circuits during training. We further propose the Cognitive Alignment Score (CAS), quantifying agreement between model-internal representations and human concepts, a saliency-first privacy mechanism sharing only thresholded attribution nodes, a bias-aware regularizer aligning subgroup statistics, and a Reveal-to-Revise loop integrating attribution signals into parameter updates without separate fine-tuning. Evaluated on CelebA, FairFace, Jigsaw, and HateXplain, our method achieves 94.1\% accuracy, 92.3\% macro F1, 79.4\% IoU-XAI, and 12.7 FID at 72--76\% adversarial robustness, while reducing subgroup disparity Δbias by 41\%, demonstrating that mechanistic interpretability, fairness, and generative performance can be jointly optimized.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.