free casino credit.Enjoy Free 888+200 Daily Legal Bonus

Today's paper introduces MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The paper presents a comprehensive study on building performant MLLMs, focusing on data-centric approaches and training strategies. They demonstrate that careful data curation and training can yield strong performance even at small model scales.

Method Overview

The paper introduces MM1.5, a family of MLLMs that builds upon the MM1 architecture. The overall approach involves a three-stage training process: large-scale pre-training, high-resolution continual pre-training, and supervised fine-tuning (SFT).

In the first stage, they conduct large-scale pre-training using a mix of image-caption, interleaved image-text, and text-only data. This stage is crucial for achieving state-of-the-art few-shot results across multiple benchmarks.

The second stage involves continual pre-training with high-resolution OCR data and synthetic captions. This stage is particularly important for boosting text-rich image understanding performance. They use carefully selected OCR data and high-quality synthetic image captions, either from public sources or generated using a previously trained MM1 model.

The final stage is supervised fine-tuning, where they use a carefully curated mixture of public datasets. They conduct extensive ablations to identify trade-offs and synergies between different data categories, ultimately constructing a mixture that contributes to well-balanced performance across a wide set of capabilities.

Throughout the training process, they employ dynamic high-resolution image encoding, which involves dividing images into sub-images for more effective processing. They also incorporate coordinate tokens to enable visual referring and grounding capabilities.

The paper presents both dense models (ranging from 1B to 30B parameters) and Mixture-of-Experts (MoE) variants. They also introduce specialized variants: MM1.5-Video for video understanding and MM1.5-UI for mobile UI understanding.

Results

The paper reports several key results:

MM1.5 demonstrates strong performance across a wide range of multimodal tasks, from general-domain to text-rich image understanding, coarse- to fine-grained understanding, and single- to multi-image reasoning.
Even at small scales (1B and 3B parameters), MM1.5 achieves competitive performance, outperforming larger open-source models on various downstream tasks.
The MM1.5 recipe exhibits strong scaling behavior up to 30B parameters, achieving competitive performance across a wide range of benchmarks.
The specialized variants, MM1.5-Video and MM1.5-UI, show promising results in their respective domains of video understanding and mobile UI comprehension.

Conclusion

The paper introduces MM1.5, a significant upgrade over its predecessor MM1, offering enhanced capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhang, Haotian, et al. "MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning." arXiv preprint arXiv:2409.20566 (2024).

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Conclusion

AI Paper of the Day

915 位关注者

更多精彩文章

Method Overview

Results

Conclusion

AI Paper of the Day

915 位关注者

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Pixtral 12B

2024年10月10日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日