VG2GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer
Abstract
Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose VG2GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. VG2GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables VG2GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. VG2GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.