Leveraging Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing
Abstract
Multi-human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance-level and part-level information for fine-grained human understanding. In this work, we demonstrate that, while state-of-the-art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi-view information to improve multi-human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi-view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi-automatic annotation strategy to generate human instance segmentation masks from multi-view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20\% relative improvement on human parsing over the baseline model in occlusion scenarios.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.