Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning

Abstract

Large language models produce fluent but often incorrect multi-step reasoning, and naive correction methods risk degrading already-correct answers. We introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that treats verification question outputs as noisy measurements of where a solution may be corrupted. Using these signals, DISC progressively reduces errors across multiple verify-judge-correct passes, analogous to traditional iterative denoising. A binary judgment gate controls correction precision by blocking rewrites that would damage already-correct answers while the verifier and corrector together repair errors. We evaluate this trade-off using two paired diagnostics: an improvement-to-degradation ratio (precision) and a repair rate (recall). Across three benchmarks (BIG-Bench Mistake, HotpotQA, GPQA Diamond) and four models, DISC dominates Chain-of-Verification and Self-Refine on the precision-recall trade-off, reaching 81.6% accuracy with 13x more improvements per degradation than Chain-of-Verification and 5x more than Self-Refine on BIG-Bench Mistake (Sonnet~4.5). On GPQA Diamond, we identify a capability floor below which judges acknowledge contradictions in evidence but cannot translate that recognition into a correction. We further show that cross-model role allocation -- assigning verification and judgment to a model different from the generator -- mitigates self-confirmation bias.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…