A Mechanistic Investigation of Supervised Fine Tuning

Abstract

The cosine similarity between a large language model's hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model's activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae-investigation.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…