MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Credit: https://arxiv.org/pdf/2409.20566

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Today's paper introduces MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The paper presents a comprehensive study on building performant MLLMs, focusing on data-centric approaches and training strategies. They demonstrate that careful data curation and training can yield strong performance even at small model scales.

Method Overview

The paper introduces MM1.5, a family of MLLMs that builds upon the MM1 architecture. The overall approach involves a three-stage training process: large-scale pre-training, high-resolution continual pre-training, and supervised fine-tuning (SFT).

In the first stage, they conduct large-scale pre-training using a mix of image-caption, interleaved image-text, and text-only data. This stage is crucial for achieving state-of-the-art few-shot results across multiple benchmarks.

The second stage involves continual pre-training with high-resolution OCR data and synthetic captions. This stage is particularly important for boosting text-rich image understanding performance. They use carefully selected OCR data and high-quality synthetic image captions, either from public sources or generated using a previously trained MM1 model.

The final stage is supervised fine-tuning, where they use a carefully curated mixture of public datasets. They conduct extensive ablations to identify trade-offs and synergies between different data categories, ultimately constructing a mixture that contributes to well-balanced performance across a wide set of capabilities.

Throughout the training process, they employ dynamic high-resolution image encoding, which involves dividing images into sub-images for more effective processing. They also incorporate coordinate tokens to enable visual referring and grounding capabilities.

The paper presents both dense models (ranging from 1B to 30B parameters) and Mixture-of-Experts (MoE) variants. They also introduce specialized variants: MM1.5-Video for video understanding and MM1.5-UI for mobile UI understanding.

Results

The paper reports several key results:

  1. MM1.5 demonstrates strong performance across a wide range of multimodal tasks, from general-domain to text-rich image understanding, coarse- to fine-grained understanding, and single- to multi-image reasoning.
  2. Even at small scales (1B and 3B parameters), MM1.5 achieves competitive performance, outperforming larger open-source models on various downstream tasks.
  3. The MM1.5 recipe exhibits strong scaling behavior up to 30B parameters, achieving competitive performance across a wide range of benchmarks.
  4. The specialized variants, MM1.5-Video and MM1.5-UI, show promising results in their respective domains of video understanding and mobile UI comprehension.

Conclusion

The paper introduces MM1.5, a significant upgrade over its predecessor MM1, offering enhanced capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhang, Haotian, et al. "MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning." arXiv preprint arXiv:2409.20566 (2024).

要查看或添加评论,请登录