SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Abstract

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a model's strong visual understanding often fails to transfer to visual generation: it may correctly judge prompt-image alignment while failing to generate a faithful image from the same prompt. This raises a compelling question: Can a model improve itself by using its understanding module to reward its generation module? We introduce SRUM, a self-rewarding post-training framework directly applicable to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve generation without additional human-labeled data or external reward models. To provide comprehensive feedback, SRUM uses a global-local dual reward system: a global reward ensures overall visual semantics and layout, while a local reward refines fine-grained, object-level fidelity. SRUM shows strong generalization, boosting performance on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82 to 46.75. Overall, our work establishes a powerful paradigm for enabling a UMM's understanding module to guide and enhance its own generation via self-rewarding.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…