Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

Abstract

Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism (β∈ \βlo, βmid, βhi\ derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3\,×\,Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that higher offline conservatism monotonically increases reward-hacking damage, measured by the Goodhart gap and its area under the curve (AUGC), with Spearman ρ= 1.0 across all three conditions. Mechanistic analysis reveals a three-link causal chain: (i) high-β DPO compresses policy entropy, (ii) Low-entropy policies generate responses with reduced diversity, concentrating in a narrow region of the reward model's training distribution (lower pairwise cosine distance), and (iii) despite this proximity, ensemble disagreement (epistemic uncertainty) increases with β and is exploited faster during online optimisation. We further fit a power-law curve to the (β, ) data and identify a practical optimal conservatism level β that balances alignment fidelity against hacking vulnerability. Our results suggest that the field needs calibrated, not maximal, conservatism.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…