VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
Abstract
Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progress in TTA for vision-language classification, TTA for VLODs remains largely unexplored. The only prior method relies on a mean-teacher framework that introduces significant latency and memory overhead. To this end, we introduce VLOD-TTA, a TTA method that leverages dense proposal overlap and image-conditioned prompts to adapt VLODs with low additional overhead. VLOD-TTA combines (i) an IoU-weighted entropy objective that emphasizes spatially coherent proposal clusters and mitigates confirmation bias from isolated boxes, and (ii) image-conditioned prompt selection that ranks prompts by image-level compatibility and aggregates the most informative prompt scores for detection. Our experiments across diverse distribution shifts, including artistic domains, adverse driving conditions, low-light imagery, and common corruptions, indicate that VLOD-TTA consistently outperforms standard TTA baselines and the prior state-of-the-art method using YOLO-World and Grounding DINO. Code : https://github.com/imatif17/VLOD-TTA
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.