Cheap Reward Hacking Detection
Abstract
A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the L1 distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW sanitized LLM-as-judge AUC (0.9510 on the cleaned split) and exceeding its TPR@5%FPR (0.7130 vs 0.8296) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to 0.6213.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.