AugLift: Depth-Aware Input Reparameterization Improves Domain Generalization in 2D-to-3D Pose Lifting
Abstract
Lifting-based 3D human pose estimation infers 3D joints from 2D keypoints but generalizes poorly because (x,y) coordinates alone are an ill-posed, sparse representation that discards geometric information modern foundation models can recover. We propose AugLift, which changes the representation format of lifting from 2D coordinates to a 6D geometric descriptor via two modules: (1) an Uncertainty-Aware Depth Descriptor (UADD) -- a compact tuple (c, d, d, d) extracted from a confidence-scaled neighborhood of an off-the-shelf monocular depth map -- and (2) a scale normalization component that handles train/test distance shifts. AugLift requires no new sensors, no new data collection, and no architectural changes beyond widening the input layer; because it operates at the representation level, it is composable with any lifting architecture or domain generalization technique. In the detection setting, AugLift reduces cross-dataset MPJPE by 10.1% on average across four datasets and four lifting architectures while improving in-distribution accuracy by 4.0%; post-hoc analysis shows gains concentrate on novel poses and occluded joints. In the ground-truth 2D setting, combining AugLift with PoseAug's differentiable domain generalization achieves state-of-the-art cross-dataset performance (62.4\,mm on 3DHP, 92.6\,mm on 3DPW; 14.5% and 22.2% over PoseAug), demonstrating that foundation-model depth provides genuine geometric signal complementary to explicit 3D augmentation. Code will be made publicly available.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.