Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees

Abstract

Chain-of-thought (CoT) answer-hijack templates can flip the final numeric answer of a 7B-8B language model on GSM8K or MATH-500 even when the visible reasoning trace looks fluent. Activation patching is the standard probe for locating where this hijack can be undone, and a successful clean-source patch is often read as evidence that the patched activation carries the recovered content. We show that this reading is unsound: clean-only localization profiles (peak, spread, thresholded band) underidentify the frozen-hook source contrast, and the clean-only profile is an intervention map, not a mediation certificate. We then construct an audit that turns each candidate patch into a source-control certificate with a pre-registered Type-I guarantee. The certificate runs in three stages: SELECT (clean-source band sweep with permutation calibration and held-out validation), FREEZE (lock the hook), and AUDIT (paired-bootstrap source contrasts at the frozen hook). It emits an incorrect mechanism label with probability at most alpha = alphasel + alphaaudit under sample-split disjointness. A matching-rate sample-complexity theorem (nstar = Theta(Delta-2 log(1/alpha))) bounds the audit cost. On Qwen2.5-7B and Llama3-8B, three few-shot/puzzle cells pass confirmatory K=1 localization with held-out gaps +32.6, +45.1, +17.7; fixed-hook reruns recover 47.0% (Qwen-puzzle) and 39.0% (Llama3-puzzle) at n=100; frozen MATH-500 transfer recovers 26.0%. After audit, Llama3-PZ and Qwen-PZ are identity-light with moderate magnitude (Qwen-PZ also layer-sensitive); Llama3-FS is a single-seed moderate-positive candidate (multi-seed replication queued); Qwen-FS is exploratory non-separation with a layer-sensitive flag. The method is a diagnostic auditing protocol, not an adaptive safety defense.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…