The development of open-source multi-modal models has recently gained momentum, with two notable contributions being the Molmo models from the Allen Institute for AI (Ai2) and the Llama 3.2 models from Meta. These models are designed to process both text and images, offering a wide range of applications in document understanding, image captioning, and visual reasoning.
Molmo Models
The Molmo family, introduced by Ai2, consists of four models: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B. These models are built on a novel architecture that combines a pre-processor for multi-scale, multi-crop image processing, a ViT image encoder (OpenAI’s ViT-L/14 336px CLIP), a connector MLP for vision-language projection and pooling, and a decoder-only Transformer language model.
- Molmo-72B: The flagship model, based on Alibaba Cloud’s Qwen2-72B open-source model, has demonstrated exceptional performance on various benchmarks. It scores 96.3 on DocVQA and 85.5 on TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories.
- Molmo-7B Models: The Molmo-7B-O and Molmo-7B-D models, using the fully open OLMo-7B-1024 LLM and the open-weight Qwen2 7B LLM respectively, offer a balance between performance and accessibility. They perform between GPT-4V and GPT-4o on various benchmarks, with specific scores not detailed in the current sources but indicating a competitive performance.
- MolmoE-1B: The most efficient model, based on the OLMoE-1B-7B mixture-of-experts LLM, nearly matches GPT-4V on both academic benchmarks and human preference evaluations. Specific scores are not provided in the current sources, but it is noted to perform comfortably between GPT-4V and GPT-4o.
Technical Insights
- Training Approach: Molmo uses a two-stage training approach: caption generation pre-training followed by supervised fine-tuning on a diverse mixture of datasets. This includes standard academic benchmarks and newly created datasets that enable the models to handle complex real-world tasks like document reading, visual reasoning, and even pointing.
- Dataset: The key innovation behind Molmo’s success is the PixMo-Cap dataset, a novel collection of highly detailed image captions gathered from human speech-based descriptions. This dataset comprises 712,000 images with approximately 1.3 million captions.
Performance Highlights
- Molmo-72B: Achieves the highest average score (81.2%) across 11 academic benchmarks and ranks second in human preference evaluations, just behind GPT-4o.
- MolmoE-1B: Nearly matches GPT-4V on both academic benchmarks and human preference evaluations.
- Molmo-7B Models: Perform comfortably between GPT-4V and GPT-4o on both academic benchmarks and user preference.
Open-Source and Accessibility
- Release Plan: Ai2 will be releasing all model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available starting today.
Llama 3.2 Models
Meta’s Llama 3.2 models have expanded the capabilities of large language models (LLMs) with enhanced multimodal support. The Llama 3.2 collection includes models of various sizes, from lightweight text-only 1B and 3B parameter models to small and medium-sized 11B and 90B parameter models capable of sophisticated reasoning tasks, including multimodal support for high-resolution images.
- Llama 3.2 1B and 3B: Lightweight text-only models suitable for edge devices and mobile applications, ideal for tasks such as personal information management, multilingual knowledge retrieval, text summarisation, classification, and language translation.
- Llama 3.2 11B and 90B: Medium-sized models that support multimodal input, including high-resolution images up to 1120x1120 pixels, enabling tasks like document-level understanding, interpretation of charts and graphs, and image captioning.
Performance Highlights
- Llama 3.2 90B-Vision: Matches OpenAI’s GPT-4o on chart understanding (ChartQA) and outperforms Anthropic’s Claude 3 Opus and Google’s Gemini 1.5 Pro on interpreting scientific diagrams (AI2D).
- Llama 3.2 11B-Vision: Beats Gemini 1.5 Flash 8B on document visual Q&A (DocVQA), tops Claude 3 Haiku and Claude 3 Sonnet on AI2D, ChartQA, and visual mathematical reasoning (MathVista), and keeps pace with Pixtral 12B and Qwen2-VL 7B on general visual Q&A (VQAv2).
- Llama 3.2 3B: Matches the larger Llama 3.1 8B on tool use (BFCL v2) and exceeds it on summarisation (TLDR9+), with the 1B model rivaling both on summarisation and rewriting tasks.
Technical Insights
- Training Approach: Llama 3.2 models use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to adapt the model to follow specific instructions and generate more relevant responses.
- Multimodal Capabilities: The 11B and 90B Vision models integrate image encoder representations into the language model, enabling tasks that involve both visual and textual data.
- Efficiency: All models support grouped-query attention (GQA), which enhances inference speed and efficiency, particularly beneficial for the larger 90B model.
Open-Source and Accessibility
- Availability: Llama 3.2 models are available on various platforms, including Amazon Bedrock, Databricks, and IBM’s watsonx.ai, facilitating access and integration for developers.
- Customisation: The open-source nature of Llama 3.2 allows for fine-tuning and customisation, enabling developers to create tailored solutions for specific use cases.
Conclusion
The Molmo and Llama 3.2 models represent a notable development in the field of open-source multi-modal AI. Their performance and accessibility offer a competitive alternative to proprietary models, potentially democratising access to advanced AI capabilities and fostering innovation in various applications.
If you found this article informative and valuable, consider sharing it with your network to help others discover the power of AI.
--
5 个月bom dia qual seu plano por hoje