Bigger Isn't Always Better: A Comparative Evaluation of LLMs for Automated Code Review
Abstract
We present a systematic evaluation of five large language models on automated code review, comparing Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4 mini, Minimax M2.7, and GLM-5 Turbo across 150 code review samples - 100 synthetic mutation-injected bugs and 50 real bug-fix pull requests mined from eight major open-source repositories. Our principal finding is that Claude Haiku 4.5, a smaller and cheaper model, consistently outperforms the larger Claude Sonnet 4.6, achieving higher F1 (0.365 vs. 0.343), 18% higher recall, and superior qualitative review scores across all four evaluation dimensions, at 3.2x lower cost per review. This result holds across three independent experimental conditions (n=25, n=100, n=150) and is independently confirmed on the Martian Code Review Benchmark, a third-party evaluation with different repos, golden comments, and judge. We further report three secondary findings: (1) synthetic-only evaluation dramatically overestimates model capability - on real PRs alone, the best model achieves F1 = 0.066, compared to F1 = 0.847 on synthetic samples, a 92% degradation; (2) diff size is the dominant predictor of review quality, with F1 dropping from 0.657 on diffs under 10 lines to 0.043 on diffs over 150 lines; and (3) all models exhibit near-zero recall on performance-related bugs. We release our evaluation framework and dataset for reproducibility.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.