Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning
Abstract
Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@K as the canonical metric. Yet the standard policy class draws K independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@K requires only one correct attempt. We propose Coordinated Pass@K Policy Optimization (CPPO), which turns pass@K generation into joint exploration over strategies: a planner emits a tuple of K=4 alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, Rplan = Jψ· Rout, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@4 over direct sampling, planning baselines, planner-only SFT, and pass@K-oriented RL under the same K=4 solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is +0.16 on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO (0.588 → 0.748; paired bootstrap, p < 0.05).
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.