World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

Emmanuelle Bourigault

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

Abstract

Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce , an evaluation framework for auditing the language-expressed physical commitments of VLMs. Instead of scoring only I,q a, we ask models to produce a typed trace I,q(s0,Δs,s1,a): an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release , a controlled trace resource with schema- and recomputation-validated synthetic scenarios across physics families, minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate VLMs on both controlled and external physical-reasoning examples. reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…