LightOcc: Lightweight Spatial Embedding for Efficient Vision-based 3D Occupancy Prediction
Abstract
Occupancy prediction has garnered increasing attention in recent years for its comprehensive fine-grained environmental representation and strong generalization to open-set objects. Nevertheless, mainstream occupancy prediction methods employ cumbersome voxel features as the scene representation, incurring substantial overheads in both memory and computation. When comparing the occupancy distribution in each spatial dimension, we find that the information entropy of the height dimension is much lower than the other two dimensions that constitute the Bird's Eye View (BEV) plane, which indicates that the height distribution of occupancy is easier to learn and predict. Accordingly, we propose Lightweight Spatial Embedding that can represent complete height information in a more compact way than voxel features, thus significantly enhancing its deployability. First, Single-Channel Occupancy is sampled from the multi-view depth distributions, which is then processed by Spatial-to-Channel mechanism to extract Lightweight Spatial Embeddings of different views by 2D convolution. These embeddings will interact with each other through the Lightweight Cross-View Interaction module to obtain the Unified Embedding, which can directly supplement BEV features with height information. Furthermore, we extract Edge-aware Spatial Embedding and apply Geometric Supervision on Spatial Embeddings, aiming to enhance their capability to represent spatial information. We also propose BEV-CutMix, a feature-level data augmentation strategy, to increase the diversity of the driving scenes. We integrate these innovative components into a pure 2D convolutional model, namely LightOcc. Sufficient experimental results show that LightOcc achieves state-of-the-art performance on multiple benchmarks while demonstrating significant efficiency advantages.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.