Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings
Abstract
Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings vsot, vpr, veot, vpad. We discover that vpr contribute minimally to generation in memorized cases. In contrast, vpad strongly affect memorization due to their structural duplication of veot, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of veot, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the veot; (2) Partial masking of vpad. Both suppress memorization without degrading quality, and are readily deployable without prior detection.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.