How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis
Abstract
Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with r*(x) = σ*( θ*, x) and x N(0, Id). We analyze a two-stage neural reward model that first learns the hidden direction θ* from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature β1 above a dimension-free O(1) threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights ey/β2 and a more practical surrogate-weighted fit with weights era0(x)/β2. Keeping the β2-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering β2 against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.