Endogenous Resistance to Activation Steering in Language Models

Abstract

Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by μltiAttemptReductionPct\%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at https://github.com/agencyenterprise/endogenous-steering-resistancegithub.com/agencyenterprise/endogenous-steering-resistance.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…