Aligned explanations in neural networks
Abstract
As artificial intelligence increasingly drives critical decisions, the ability to genuinely explain how neural networks make predictions is essential for trust. Yet, most current explanation methods offer post-hoc rationalizations rather than guaranteeing a true reflection of the model's reasoning. We introduce the notion of explanatory alignment, a requirement that explanations directly construct predictions rather than rationalize them. To achieve this in complex data domains, we present Pointwise-interpretable Networks (PiNets), a pseudo-linear architecture that forms linear models instance-wise. Evaluated on image classification and segmentation tasks, PiNets demonstrate that their explanations are deeply faithful across four criteria: meaningfulness, alignment, robustness, and sufficiency (MARS). Our contributions pave the way for promising avenues: by reconciling the predictive power of deep learning with the interpretability of linear models, PiNets provide a principled foundation for trustworthy AI and data-driven scientific discovery.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.