Binary Rewards and Reinforcement Learning: Fundamental Challenges

Marc Dymetman

Binary Rewards and Reinforcement Learning: Fundamental Challenges

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit β 0, the filtered model p*:=a(·1) -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution p[β] a(y)\,ev(y)/β converges to p* in forward KL as β 0, yet p* cannot serve as a direct optimization target because KL(q\,\|\,p*) is infinite for any full-support policy q. We develop explicit formulas relating the hyperparameter β to the more interpretable target validity rate μ. Under model misspecification -- the typical practical regime -- the pressure to decrease β drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as β decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target p* directly -- as pursued empirically by kruszewskiwhatever2026 -- avoid this failure mode by rewarding coverage of p*'s support rather than concentration on high-validity outputs.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…