The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives
Abstract
Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law p into a sampler that favors a reward r while remaining close to p. Since there is no canonical distributional distance for this closeness constraint, different choices lead to different "reward-aligned" laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is q(x) p(x) (λ-1r(x)). For this setting, we show that linear exponential tilts of the form q(x) p(x)( θ, x ) -- which according to recent work [MRR26] can be efficiently sampled from -- are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given x, solve argmaxy \r(y)- λ c(x,y)\. This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards r(x)=f(Ax). Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.