Efficient-VLN: A Simple yet Strong Baseline for Efficient Vision-Language Navigation
Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated significant promise in Vision-Language Navigation (VLN), existing agents remain heavily constrained by systemic bottlenecks across inference, training, and data collection. Specifically, they suffer from prohibitive latency due to visual history reprocessing, action leakage during sequence-packed training, and suboptimal exploration in self-correction data collection. To overcome these intertwined challenges, we present Efficient-VLN, a highly efficient and robust baseline that systematically resolves these issues through three simple-yet-effective mechanisms. (1) Inference: We introduce KV-cache reuse with contiguous RoPE, enabling the model to process only the newly observed frame at each step for real-time inference. (2) Training: We propose packed training with an action-isolating mask to accelerate throughput while effectively bridging the training-inference gap by preventing action leakage. (3) Data Collection: We employ an Adaptive DAgger to dynamically balance autonomous exploration and oracle guidance, enhancing error-recovery capability without escalating computational costs. Extensive evaluations show that Efficient-VLN significantly advances the state-of-the-art across the R2R-CE (73.2% SR) and RxR-CE (75.6% SR) benchmarks. Meanwhile, it yields a 28% latency reduction compared to the previous state-of-the-art StreamVLN, establishing a new paradigm for streaming MLLM-based navigation.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.