StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering
Abstract
We present StepGap, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: Contradicted Claim (CC), Irrelevant Evidence (IE), or Missing Bridge (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, κ=0.704), StepGap reaches sF1=72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage hurts F1 when removed, while three of four LLM-only removals improve F1 -- a sign of competing-error cancellation, where internal stages mask each other's errors. We further expose a Q-F1 trap: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from 32.10.3 to 35.40.9 across three seeds, with the single-run comparison showing a +5.6 Avg EM gain over the matched Search-R1 GRPO reproduction.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.