SAFE-DiT: Semantics-Aware Fast-path Execution for High-Resolution Diffusion Transformers

Abstract

High-resolution Diffusion Transformer (DiT) inference contains substantial spatial redundancy, but many spatially adaptive implementations encode regional computation as attention masks, which can inadvertently move scaled dot-product attention (SDPA) away from FlashAttention fast paths. We identify this avoidable systems bottleneck as Mask-Induced Dispatch Tax (MIDT) and show that it grows with latent sequence length. We introduce SAFE-DiT, a training-free Semantics-Aware Fast-path Execution framework that separates exact mask elision from approximation-based spatial scheduling. SAFE-DiT removes only provenance-certified image self-attention masks that induce a row-wise constant shift in attention logits, preserves semantics-bearing masks such as text-padding masks, and realizes spatial adaptation through prompt-conditioned token partitioning, selective state updates with global context, and periodic context refresh. We call this acceleration-only configuration SAFE-Core and report sensitivity-weighted classifier-free guidance separately as SAFE-DiT+SW. On the evaluated PyTorch SDPA stack, redundant masks make long-sequence attention 4.1× to 5.8× slower than the mask-free path. On Lumina-Next, SAFE-DiT achieves 2.69× end-to-end acceleration at 10242 resolution and 5.09× at 25602, reduces peak memory at 25602 from 94.1 to 27.9 GB, and enables 30722 generation when dense inference runs out of memory. Paired metrics, component ablations, and a blinded human study support visual non-inferiority of SAFE-Core to the dense fast-path baseline, while SAFE-DiT+SW provides a separate prompt-alignment operating point without reintroducing spatial self-attention masks. Code is available at https://github.com/xuanhuayin/SAFE-DiT.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…