Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Abstract

Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on 1,000 MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that sycophantic GRPO produces consistent directional calibration degradation -- ECE rises by +0.006 relative to the base model and MCE increases by +0.010 relative to neutral SFT -- though the effect does not reach statistical significance (p = 0.41) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by 40--64\% and improves accuracy by 1.5--3.0 percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control (0.042 vs.\ 0.037), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…