Text Adversarial Attacks with Dynamic Outputs

Abstract

Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…