Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation
Abstract
Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges (=1, maximum-entropy transport) and deterministic Optimal Transport ( 0, as in Conditional Flow Matching), controlled by a single entropic regularization parameter . We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire -spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate that balances multimodal coverage and path straightness. Empirically, while standard bridges require ≥ 10 steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.