TraversalBench: Challenging Paths to Follow for Vision Language Models

Abstract

Vision-language models (VLMs) perform strongly on multimodal benchmarks, but their ability to follow complex visual paths remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a continuous polyline with a unique start marker and labeled vertices; models must recover the ordered sequence encountered from start to finish. The benchmark balances self-intersection count, tortuosity, vertex count, and nearby confounding lines while limiting reliance on OCR, world knowledge, or open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis localizes failures to crossing points: performance is stable before the first crossing, then drops sharply when the model must resolve the correct continuation. Nearby confounders have weaker but compounding effects, and an auxiliary reading-order benchmark reveals a consistent left-to-right bias. Together, these results characterize how VLMs perceive and fail on visual paths. Finally, we position TraversalBench as a new contribution to the growing line of sustained and precise visual grounding benchmarks for VLMs. Code, benchmark data, and rendered examples are available at https://github.com/clarapetrova/traversalbench.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…