A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

Abstract

Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a relative-budget theory explaining this variation through a single quantity called relative budget := H/E[T], where H is the generation horizon (token budget) and T denotes the number of tokens until the first correct solution under a base policy. We show that determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the deficient regime ( 0), informative trajectories are rare and the sample complexity explodes; in the balanced regime (=(1)), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the ample regime ( ∞), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget ∈ [1.5, 2.0] that maximizes learning efficiency and coincides with peak reasoning performance.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…