When Safe Concepts Become Unsafe: Multi-Concept Compositional Vulnerabilities in Text-to-Image Models

Abstract

Text-to-image (T2I) models are increasingly optimized for following user instructions faithfully. However, we find that this capability introduces a safety vulnerability we call Multi-Concept Compositional Unsafety (MCCU). MCCU occurs when multiple individually safe concepts, if combined in a single generation request, lead to harmful or sensitive visual outputs. Unlike prior jailbreak settings, MCCU does not rely on adversarial prompts, model access, or explicitly disallowed content. Instead, the risk emerges from how the model composes multiple safe visual concepts into a single scene. To systematically measure this threat, we build TwoHamsters, a large-scale evaluation framework consisting of 20k prompts, 51 curated concept pairs, and six risk categories. We evaluate 13 T2I models under a black-box setting. Our results show a clear conflict between instruction-following and safety: models that follow prompts more faithfully tend to produce more MCCU failures. For example, FLUX.1 achieves a 99.35% Unsafe Alignment Rate while only reaching a 1.57% MCCU Defense Rate. We further evaluate three representative defenses, including safety filtering, MCCU-specific detector fine-tuning, and concept erasure, all of which fail against unseen concept combinations. Our findings suggest that compositional reasoning in T2I models creates an attack surface that is not captured by existing safety mechanisms. We anticipate the release of TwoHamsters will catalyze community development of advanced generative defense mechanisms.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…