EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present EquiVLA, the first general framework for end-to-end SO(2)-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. EquiVLA introduces EquiPerceptor, which produces approximately SO(2)-equivariant visual representations from frozen ViT features; and EquiActor, an exactly SO(2)-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate SO(2) equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD, and five real-robot tasks on Mobile ALOHA, EquiVLA achieves 92.6\% average success on LIBERO (vs. 78.1\% baseline), an average sequence length of 4.03 on CALVIN (vs. 3.45), and improves real-robot success from 54\% to 72\%.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.