Beyond the Sampled Token: Preserving Candidate Support in RLVR
Abstract
We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the candidate distribution for next-token prediction. We formally show that as probability concentrates on the top-1 candidate, the expected number of distinct responses collapses to one regardless of the sampling budget K. This theoretical implication is further verified by our empirical tracking of top-N candidate probabilities during training, where the top-1 candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: preserving non-negligible probability mass on the top-N candidates. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-N candidates for correct responses, and applies a stronger penalty to the top-1 candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@K at the cost of pass@1, CaSP improves pass@K across the full K spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to K=1024, positioning it as a principled, candidate-level approach for RLVR exploration.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.