AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory
Abstract
Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a frozen foundation model and a lightweight fast path that amortizes the foundation model's computational cost over time. The foundation model periodically produces high-quality spatial features in the background, while the lightweight fast path runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating memory. This enables cross-frame feature reuse with bounded accuracy degradation. With 3.83M trainable fast-path parameters and a 97.5M frozen slow path, AsyncMDE's fast path operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model. Across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades predictably and reaches 161 FPS fast-path inference on a TensorRT-optimized Jetson AGX Orin, supporting real-time edge deployment.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.