Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Abstract

Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…