The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

Dhruv Rohatgi

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

Abstract

Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law p into a sampler that favors a reward r while remaining close to p. Since there is no canonical distributional distance for this closeness constraint, different choices lead to different "reward-aligned" laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is q(x) p(x) (λ-1r(x)). For this setting, we show that linear exponential tilts of the form q(x) p(x)( θ, x ) -- which according to recent work [MRR26] can be efficiently sampled from -- are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given x, solve argmaxy \r(y)- λ c(x,y)\. This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards r(x)=f(Ax). Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…