Toward Inherently Robust VLMs Against Visual Perception Attacks
Abstract
Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.