Testing the Validity of Embedding-Based Similarity and Clustering for Handwritten Physics Solutions

Abstract

Text embeddings are increasingly used in physics education research to organize, compare, and cluster large collections of written text. Their appeal is clear: once student responses have been mapped into a vector space, similarity comparisons and clustering become computationally inexpensive. However, in assessment contexts, the relevant question is not merely whether clusters can be produced, but whether the geometry of the embedding space preserves grading-relevant distinctions. We tested this premise using 992 handwritten student-problem solutions from a high-stakes engineering thermodynamics exam, transcribed into five textual representations and embedded using nine embedding mechanisms. We compared embedding similarity and embedding-based hierarchical clusters against human-assigned scores. Across models, representations, and clustering choices, embedding similarity showed a consistent but modest relationship to score similarity, and the resulting clusters were score-enriched but not score-equivalent. Experiments with a synthetic data set suggest that this may be due to embeddings behaving like novices when categorizing physics-problem solutions, that is, their similarity geometry is strongly influenced by surface features rather than conceptual, semantic structure. These findings suggest that state-of-the-art embeddings can support exploratory organization and human-in-the-loop review of physics solutions, but they do not provide an unsupervised basis for grading without external validation against the assessment construct of interest.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…