Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Abstract

As Reinforcement Learning with Verifiable Rewards (RLVR) substantially improves the reasoning abilities of large language models (LLMs), a new bottleneck emerges: more training problems become saturated, that is, the LLM answers the questions correctly for nearly every rollout. On such problems, rewards provide little useful learning signal. While collecting harder problems is a natural response, it is costly and increasingly difficult. We propose failure-prefix conditioning, a simple method that unlocks the remaining signal in saturated problems by shifting exploration toward failure-prone reasoning states. By conditioning on prefixes of rare incorrect trajectories, the method improves the model's ability to recover from misleading early reasoning. We observe that failure-prefix conditioning consistently improves performance where standard RLVR stalls, and achieves gains comparable to training on newly collected medium-difficulty problems. We further analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results show that saturated problems still contain valuable learning signal, and that failure-prefix conditioning provides an effective way to unlock it.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…