CLIP Brings Better Features to Visual Aesthetics Learners
Abstract
Image Aesthetics Assessment (IAA) is a challenging task due to its subjective nature and expensive manual annotations. Recent large-scale vision-language models, such as Contrastive Language-Image Pre-training (CLIP), have shown their promising representation capability for various downstream tasks. However, the application of CLIP to resource-constrained and low-data IAA tasks remains limited. While few attempts to leverage CLIP in IAA have mainly focused on carefully designed prompts, we extend beyond this by allowing models from different domains and with different model sizes to acquire knowledge from CLIP. To achieve this, we propose a unified and flexible two-phase CLIP-based Semi-supervised Knowledge Distillation (CSKD) paradigm, aiming to learn a lightweight IAA model while leveraging CLIP's strong generalization capability. Specifically, CSKD employs a feature alignment strategy to facilitate the distillation of heterogeneous CLIP teacher and IAA student models, effectively transferring valuable features from pre-trained visual representations to two lightweight IAA models, respectively. To efficiently adapt to downstream IAA tasks in a low-data regime, the two strong visual aesthetics learners then conduct distillation with unlabeled examples for refining and transferring the task-specific knowledge collaboratively. Extensive experiments demonstrate that the proposed CSKD achieves state-of-the-art performance on multiple widely used IAA benchmarks. Furthermore, analysis of attention distance and entropy before and after feature alignment shows the effective transfer of CLIP's feature representation to IAA models, which not only provides valuable guidance for the model initialization of IAA but also enhances the aesthetic feature representation of IAA models. Code will be made publicly available.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.