Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Abstract

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan, training should be designed to support it. In this work, we propose RLV that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RLV boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32× efficient test-time compute scaling compared to the base RL method. RLV also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RLV achieves 1.2-1.6× higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model. More broadly, RLV instantiates the principle of co-training for test-time scaling: jointly optimizing for task performance and a capability useful at inference, using data that RL training already produces.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…