Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation
Credit: https://arxiv.org/pdf/2503.14905

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Today's paper introduces FakeVLM, a specialized large multimodal model designed for detecting synthetic images and explaining their artifacts. The model not only distinguishes between real and AI-generated images but also provides natural language explanations for its decisions, enhancing interpretability. Additionally, the authors present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories with fine-grained artifact annotations.

Method Overview

FakeVLM is built upon the LLaVA architecture, consisting of three main components: a global image encoder (CLIP-ViT), a multi-modal projector (MLP), and a large language model (Vicuna-v1.5-7B). The model processes input images at a resolution of 336×336 to preserve synthetic artifact details, resulting in 576 patches per image.

The approach differs from traditional synthetic detection methods by framing the task as visual question answering rather than simple binary classification. Instead of just outputting "Real" or "Fake" FakeVLM provides detailed explanations about the artifacts it detects in synthetic images. This approach not only improves the model's performance in artifact explanation but also significantly enhances its overall synthetic image detection capabilities.

To train the model, the authors created FakeClue, a comprehensive dataset containing over 100,000 real and synthetic images across seven categories (animals, humans, objects, scenery, satellite, document, and deepfake). The dataset was constructed through a multi-step process: data collection from open sources and self-synthesized datasets, pre-processing with categorization, label prompt design based on category knowledge, and annotation using multiple large multimodal models (LMMs). This multi-LMM annotation strategy helps mitigate bias and hallucination effects that might occur with a single model.

The training strategy involves fine-tuning all parameters of the LLaVA model using QA pairs from the constructed dataset. Each data sample consists of an image, a standardized prompt ("Does the image look real/fake?"), and an aggregated answer from the multi-model annotations. This full-parameter fine-tuning enables comprehensive adaptation to synthetic data reasoning while maintaining the model's original instruction-following capabilities.

Results

FakeVLM demonstrates superior performance compared to existing methods in both synthetic detection and artifact explanation tasks. On the FakeClue dataset, it achieves 98.6% accuracy and 98.1% F1 score, significantly outperforming other leading general-purpose large models like Qwen2-VL-72B and GPT-4o. On the LOKI benchmark, FakeVLM achieves 84.3% accuracy, surpassing even human performance (80.1%).

In DeepFake detection tasks, FakeVLM outperforms specialized vision-language models like Common-DF, improving accuracy by 5.7%, F1 by 3%, and ROUGE_L (representing artifact explanation performance) by 9.5% on the DD-VQA dataset. On the FF++ dataset, FakeVLM maintains strong performance across multiple sub-categories such as DeepFakes, Face2Face, FaceSwap, and NeuralTextures.

The qualitative results show that FakeVLM accurately identifies potential issues arising from the synthesis process, such as visual artifacts, texture distortions, and structural anomalies, while providing detailed explanations in natural language. This significantly enhances the interpretability of the synthetic detection process and helps users make confident decisions when assessing synthetic content.

Conclusion

FakeVLM represents a significant advancement in synthetic image detection by integrating both detection and artifact explanation capabilities. Through an effective training strategy, the model leverages the potential of large multimodal models for synthetic detection without relying on expert classifiers. For more information please consult the full paper.

Congrats to the authors for their work!

Wen, Siwei, et al. "Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation." arXiv preprint arXiv:2503.14905 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了