Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Credit: https://arxiv.org/pdf/2503.14478

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Today's paper introduces Creation-MMBench, a novel benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models (MLLMs) in image-based tasks. While Large Language Models (LLMs) have been extensively evaluated for creativity, there has been a significant gap in assessing the creative intelligence of MLLMs, which this benchmark aims to address by providing a comprehensive evaluation framework across diverse real-world scenarios.

Method Overview

Creation-MMBench consists of 765 test cases spanning 51 fine-grained tasks, categorized into four major groups: Literary Writing, Common Functional Writing, Professional Functional Writing, and Creative Multimodal Understanding. For each task, an MLLM is provided with one or more images along with detailed context that specifies the assigned role, necessary background information, and clear task instructions. The model must then leverage the visual input to accomplish various creative tasks, such as composing artwork-inspired prose, developing structured lesson plans, or interpreting conceptual foundations of advertisements.

The evaluation framework implements the MLLM-as-a-Judge methodology, utilizing GPT-4o to assess the quality of model-generated responses. Recognizing that creative responses cannot be evaluated using simple rule-based methods, the benchmark defines instance-specific evaluation criteria for each test case. These criteria ensure that responses are assessed based on their ability to effectively integrate contextual and visual information.

The evaluation consists of two components: a pairwise comparison where an MLLM-generated response is compared against a reference answer, and a visual factuality score that evaluates whether the response aligns with key facts present in the visual input. This dual approach provides a comprehensive assessment of both creative quality and factual accuracy. Additionally, the paper introduces Creation-MMBench-TO, a text-only variant where image inputs are replaced with corresponding textual descriptions, to explore the impact of visual instruction tuning on creative abilities.

Results

The comprehensive evaluation conducted using Creation-MMBench revealed several important findings. Current open-source MLLMs significantly underperform compared to advanced proprietary models like Gemini-2.0-Pro and GPT-4o in terms of context-aware creativity. This performance gap highlights the challenges that open-source models face in integrating visual information with creative expression.

Furthermore, the comparison between Creation-MMBench and its text-only variant (Creation-MMBench-TO) uncovered a negative effect of visual fine-tuning on the creative abilities of the base LLM. This suggests that there may be trade-offs introduced during multimodal adaptation, where enhancing visual understanding capabilities might come at the expense of creative language generation abilities.

The benchmark also demonstrated the importance of instance-specific evaluation criteria, as different creative tasks require different assessment approaches. The dual evaluation methodology (pairwise comparison and visual factuality scoring) proved effective in providing a nuanced understanding of MLLMs' creative capabilities.

Conclusion

Creation-MMBench specifically targets evaluating LLMs on creative intelligence, an aspect that has been largely overlooked in previous benchmarks. By providing a diverse set of tasks across real-world scenarios and implementing a robust evaluation methodology, the benchmark offers valuable insights into the current limitations of MLLMs in context-aware creativity and vision-based language generation. For more information please consult the full paper.

Congrats to the authors for their work!

Fang, Xinyu, et al. "Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs." arXiv preprint arXiv:2503.14478 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察