CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
Abstract
Vision Language Models (VLMs) have shown promising planning capabilities, yet their success remains confined to the text domain, leaving visual decision-making relatively underexplored. Addressing this gap, we introduce Corrective Sequence Planning (CoSPlan) benchmark, where VLMs must plan a sequence of visual actions from an initial scene to a target scene. CoSPlan evaluates models on their ability to imagine and execute a coherent set of visual steps required to reach the goal (Step Completion). To prevent any shortcuts that simply describe the final scene, we introduce an erroneous action in decision-making, which must be detected (Error Detection) and corrected to reach the goal, enabling a deeper understanding of the task. CoSPlan spans across 4 tasks: maze navigation, block re-arrangement, image reconstruction, and object re-organization. Despite using advanced reasoning strategies such as Chain-of-Thought and Scene Graphs, VLMs struggle on CoSPlan, while still showing promising performance in the text domain. Addressing this, we propose Scene Graph Incremental updates (SGI), a novel training-free method to transform images into `textual' scene graphs, enabling step-by-step reasoning through iterative scene graph refinement. SGI yields an average of ~4.4% improvement on CoSPlan w/ generalization on PlanBench and VQA. Link for solving puzzles on the project page.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.