Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Abstract
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: (1) a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ( = 0.8-4.1\%), with 33-67\% of documents exhibiting at least one directed 3-cycle; and (2) split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed ≥(1-α) coverage, with set width serving as a per-instance reliability indicator (rs = +0.576, N=1,918, p < 10-100, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement (r = 0.32-0.38), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size ≈ 3.0) and coherence moderately so (avg. set size ≈ 3.9), while fluency and consistency remain unreliable (avg. set size ≈ 4.9). We release all code, prompts, and cached results.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.