Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

Guanbin Li

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

Abstract

Benefiting from the generalization capability of CLIP, recent vision language pre-training (VLP) models have demonstrated the ability to capture a wide range of visual concepts in daily images. However, due to the presence of unseen categories in open-vocabulary settings, existing algorithms struggle to capture semantic correlations between categories, leading to suboptimal performance on open-vocabulary multi-label recognition (OV-MLR). Furthermore, the substantial variation in the number of discriminative areas across diverse object categories is misaligned with the fixed-number patch matching used in current methods, introducing noisy visual cues that hinder the capture of target semantics. To address these challenges, we propose a novel category-adaptive cross-modal semantic refinement and transfer (C2SRT) framework to model semantic correlations both within each category and across different categories, in a category-adaptive manner. The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module. Specifically, the ISR module leverages the cross-modal knowledge of the VLP model to adaptively select a set of local discriminative regions that represent the semantics of the target category. The IST module adaptively discovers a set of correlated categories for a target category by constructing a category-adaptive correlation graph and transfers semantic knowledge from the correlated seen categories to unseen ones. Experiments on OV-MLR benchmarks demonstrate that the proposed C2SRT framework improves over current methods.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…