Contrastive vision-language learning with paraphrasing and negation
Abstract
Contrastive vision-language models continue to be the dominant approach for image-text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks to align their image and text embeddings in a shared latent space. As a challenging case-study for neurosymbolic AI, recent results evaluating CLIP on negated or paraphrased text have shown mixed performance as these are difficult to define formally for text data. Negation produces the opposite meaning using various possible but small lexical changes. Paraphrasing may use very different textual expressions to denote essentially the same thing. As a result, learning of paraphrasing and negation together poses a significant challenge because of the above mismatch between changes in syntax and intended meaning expected to be captured by distances in embedding space. This paper proposes a new CLIP contrastive loss function capable of balancing the requirements of having both paraphrasing and negation. It applies training triplets consisting of original, paraphrased and negated text generated by multiple large language models to the evaluation of CLIP models. The approach, called SemCLIP, aims to learn semantically-relevant and simple embeddings, placing paraphrased captions nearer to the original image embeddings while at the same time pushing negated captions farther away. Empirically, SemCLIP is shown to be capable of preserving roughly the same performance as CLIP augmented with either negation or paraphrasing. Although direct comparisons are difficult to make because the problem of learning with both negation and paraphrasing is different, an expected benefit of SemCLIP should be robustness when applied zero-shot to downstream image classification tasks. Our experiments confirm such robustness as measured by difference in accuracy (mean-accuracy delta) between original and negated captions on five downstream datasets.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.