EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present EquiVLA, the first general framework for end-to-end SO(2)-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. EquiVLA introduces EquiPerceptor, which produces approximately SO(2)-equivariant visual representations from frozen ViT features; and EquiActor, an exactly SO(2)-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate SO(2) equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD, and five real-robot tasks on Mobile ALOHA, EquiVLA achieves 92.6\% average success on LIBERO (vs. 78.1\% baseline), an average sequence length of 4.03 on CALVIN (vs. 3.45), and improves real-robot success from 54\% to 72\%.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…