Semantically Guided Dynamic Visual Prototype Refinement for Compositional Zero-Shot Learning
Abstract
Compositional Zero-Shot Learning (CZSL) seeks to recognize unseen state-object pairs by recombining primitives learned from seen compositions. Despite recent progress with vision-language models (VLMs), two limitations remain: (i) text-driven semantic prototypes are weakly discriminative in the visual feature space; and (ii) unseen pairs are optimized passively, thereby inducing seen bias. To address these limitations, we present Duplex, a framework that couples dual-prototype learning with dynamic local-graph refinement of visual prototypes. For each composition, Duplex maintains a semantic prototype via prompt learning and a visual prototype for unseen pairs constructed by recombining disentangled state and object primitives from seen images. The visual prototypes are updated dynamically through lightweight aggregation on mini-batch local graphs, which incorporates unseen compositions during training without labels. This design introduces fine-grained visual evidence while preserving semantic structure. It enriches class prototypes, better disambiguates semantically similar yet visually distinct pairs, and mitigates seen bias. Experiments on MIT-States, UT-Zappos, and CGQA in closed-world and open-world settings achieve competitive performance and consistent compositional generalization. Our source code is available at https://github.com/ISPZ/Duplex-CZSL.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.