Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present MathSpatial, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. MathSpatial provides two complementary subsets: (i)~MathSpatial-Bench, a rigorously curated evaluation set of 2,000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~MathSpatial-Corpus, a training set of 8,000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on MathSpatial-Bench reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on MathSpatial-Corpus yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. MathSpatial is publicly available at https://shuolucs.github.io/MathSpatial.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…