LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs

Yaoming Zhou

LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs

Abstract

Although vision-language navigation (VLN) has progressed rapidly, zero-shot VLN in continuous environments (VLN-CE) remains highly challenging when using lightweight vision-language models (VLMs), whose limited reasoning capacity makes long-horizon navigation unreliable. In this paper, we propose LightZeroNav to tackle the three major bottlenecks when using lightweight VLMs in zero-shot VLN-CE,i.e.,information redundancy from multi-source inputs, inaccurate progress estimation caused by noisy textual memory, and task entanglement between action execution and stage transition. Using only RGB observations and a lightweight open-source Qwen3-VL-8B backbone, LightZeroNav achieves competitive performance with GPT-4o (~200B) without task-specific training, graph search, or waypoint predictors, demonstrating its effectiveness in zero-shot VLN-CE.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…