When Should We Introduce Safety Interventions During Pretraining?

Abstract

Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: "When during pretraining should safety interventions be introduced?" We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…