Unified Reward Model for Multimodal Understanding and Generation
Today’s paper introduces UnifiedReward, the first unified reward model capable of evaluating both understanding and generation tasks across image and video modalities. The authors address the limitation of existing reward models that are typically task-specific, proposing instead a comprehensive approach that can handle diverse visual tasks through both pairwise ranking and pointwise scoring.
Method Overview
The UnifiedReward approach consists of three key stages:
The unified approach allows the model to leverage cross-task synergies, where enhanced image understanding improves image generation assessment, and refined image evaluation benefits video assessment through better frame analysis.
Results
The experimental results demonstrate that learning to assess multiple visual tasks jointly leads to significant mutual benefits across different domains. The authors show that their pipeline effectively improves the performance of both image and video understanding/generation models. The unified approach proves more adaptable and generalizable across various visual applications compared to task-specific reward models.
Conclusion
UnifiedReward represents a significant advancement in reward modeling for multimodal tasks by providing a single model capable of assessing both understanding and generation across image and video modalities. The work demonstrates that joint learning of diverse visual tasks creates synergistic improvements, expanding the scope and effectiveness of reward models for preference alignment. This approach reduces the need for extensive human annotations while improving the quality and alignment of vision models.
For more details: UnifiedReward Paper
Wang, Yibin, et al. “Unified Reward Model for Multimodal Understanding and Generation.” arXiv preprint arXiv:2503.05236 (2025).