Unified Reward Model for Multimodal Understanding and Generation
Credit: https://arxiv.org/pdf/2503.05236

Unified Reward Model for Multimodal Understanding and Generation

Today’s paper introduces UnifiedReward, the first unified reward model capable of evaluating both understanding and generation tasks across image and video modalities. The authors address the limitation of existing reward models that are typically task-specific, proposing instead a comprehensive approach that can handle diverse visual tasks through both pairwise ranking and pointwise scoring.

Figure 1. Overview of our UnifiedReward for Multimodal Understanding and Generation Alignment, including three steps: (1) Unified Reward Model Training, (2) Preference Data Construction, and (3) Generation/Understanding Model Alignment. -

Method Overview

The UnifiedReward approach consists of three key stages:

  1. Unified Reward Model Training: The authors construct a large-scale human preference dataset spanning both image and video generation/understanding tasks. This dataset is used to train the unified reward model capable of both pairwise ranking and pointwise scoring.
  2. Preference Data Construction: The trained UnifiedReward model is employed to automatically generate high-quality preference pair data from outputs of vision language models (VLMs) and diffusion models. This involves a multi-stage filtering process including pair ranking and point sifting.
  3. Generation/Understanding Model Alignment: The constructed preference pairs are used to align the outputs of baseline models with human preferences through Direct Preference Optimization (DPO).

Figure 2. Method Overview. The pipeline of UnifiedREward consists of three key stages: (1) Unified Reward Model Training: We train a unified reward model to evaluate both multimodal generation and understanding tasks using pointwise scoring and pairwise ranking strategy. (2) Preference Data Construction: We use the trained reward model to construct high-quality preference data through three steps: (a) data generation from VLM/Diffusion, (b) pairwise ranking to divide the chosen and rejected outputs, and (c) pointwise filtering to refine the chosen and rejected samples. (3) Generation/Understanding Model Alignment: The constructed preference data is then used to fine-tune VLM/Diffusion via Direct Preference Optimization, aligning their outputs with human preferences.

The unified approach allows the model to leverage cross-task synergies, where enhanced image understanding improves image generation assessment, and refined image evaluation benefits video assessment through better frame analysis.

Results

The experimental results demonstrate that learning to assess multiple visual tasks jointly leads to significant mutual benefits across different domains. The authors show that their pipeline effectively improves the performance of both image and video understanding/generation models. The unified approach proves more adaptable and generalizable across various visual applications compared to task-specific reward models.

Conclusion

UnifiedReward represents a significant advancement in reward modeling for multimodal tasks by providing a single model capable of assessing both understanding and generation across image and video modalities. The work demonstrates that joint learning of diverse visual tasks creates synergistic improvements, expanding the scope and effectiveness of reward models for preference alignment. This approach reduces the need for extensive human annotations while improving the quality and alignment of vision models.

For more details: UnifiedReward Paper

Wang, Yibin, et al. “Unified Reward Model for Multimodal Understanding and Generation.” arXiv preprint arXiv:2503.05236 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章