Learning Accurate Segmentation Purely from Self-Supervision
Abstract
Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground--background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on F over previous unsupervised saliency detection methods on ECSSD (+4.0\%), HKUIS (+4.6\%), and PASCAL-S (+5.7\%). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., 0.910 Sm on CHAMELEON and 0.792 Fβω on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.