The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

Abstract

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…