Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion
Abstract
Users often possess a clear visual intent but struggle to articulate it precisely in language. This intention-expression gap makes aligning generated images with latent visual preferences a fundamental challenge in text-to-image diffusion models. Existing methods either require model training, sacrificing flexibility, or rely on textual feedback, imposing a heavy cognitive burden. Although recent training-free methods use click-based binary preference feedback to reduce user effort, they force Foundation Models (FMs) to infer preferences at the semantic level. When faced with multi-dimensional preferences, FMs suffer from inference overload and fail to identify exact preferred feature values under conflicting user signals. Consequently, a flexible framework for multi-dimensional feature alignment remains absent. To address this, we propose a Hierarchical Relevance Feedback-Driven (HRFD) framework. Recognizing that multiple features struggle to converge simultaneously, HRFD organizes them into a three-tier hierarchy and adapts relevance feedback to enforce coarse-to-fine convergence, minimizing cognitive load. To bypass FM inference overload, HRFD decouples the process into independent single-feature preference inference tasks. Furthermore, to overcome FMs' failure in identifying preferred values, HRFD employs statistical inference to quantify the distribution divergence of features between "liked" and "disliked" image sets, achieving robust and transparent preference measurement. Crucially, HRFD operates entirely within the external text space, remaining strictly training-free and model-agnostic. Extensive experiments demonstrate that HRFD effectively captures the user's true visual intent, significantly outperforming baseline approaches.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.