An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models
Abstract
Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker ϕ≥ .957) and internal consistency (all α≥ .930). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed (r = .51), but neither tracked self-report: self-report--human r = -.01, self-report--judge r = .13, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges (r = .53) but not humans (r = .04), even though humans and judges agreed (r = .59) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.