FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Abstract

Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction ``vt → vt+1''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. This lack of an explicit motion reasoning step often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the Visual Chain of Thought (Visual CoT), a paradigm that compels the model to first reason about motion dynamics before generating the future frame. We instantiate this paradigm by proposing FlowVLA, an autoregressive Transformer that explicitly materializes this reasoning process as ``vt → ft → vt+1'', where ft is an intermediate optical flow prediction that inherently encodes motion. By forcing the model to first follow the motion plan encoded by ft, this process inherently aligns the pre-training objective of dynamics prediction with the downstream task of action generation. We conduct experiments on challenging robotics manipulation benchmarks, as well as real-robot evaluations. Our FlowVLA not only generates more coherent and physically plausible visual predictions, but also achieves state-of-the-art policy performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn-lab.github.io/FlowVLA/

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…