Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation
Abstract
LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose μlticod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show μlticod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50\% and improving LLM response quality, making μlticod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.