登录查看更多内容

The Rise of Open-Source Multi-Modal Models

Robyn Le Sueur

AI Lead @ ADVANTIQ

发布日期: 2024年9月28日

The development of open-source multi-modal models has recently gained momentum, with two notable contributions being the Molmo models from the Allen Institute for AI (Ai2) and the Llama 3.2 models from Meta. These models are designed to process both text and images, offering a wide range of applications in document understanding, image captioning, and visual reasoning.

Molmo Models

The Molmo family, introduced by Ai2, consists of four models: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B. These models are built on a novel architecture that combines a pre-processor for multi-scale, multi-crop image processing, a ViT image encoder (OpenAI’s ViT-L/14 336px CLIP), a connector MLP for vision-language projection and pooling, and a decoder-only Transformer language model.

Molmo-72B: The flagship model, based on Alibaba Cloud’s Qwen2-72B open-source model, has demonstrated exceptional performance on various benchmarks. It scores 96.3 on DocVQA and 85.5 on TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories.
Molmo-7B Models: The Molmo-7B-O and Molmo-7B-D models, using the fully open OLMo-7B-1024 LLM and the open-weight Qwen2 7B LLM respectively, offer a balance between performance and accessibility. They perform between GPT-4V and GPT-4o on various benchmarks, with specific scores not detailed in the current sources but indicating a competitive performance.
MolmoE-1B: The most efficient model, based on the OLMoE-1B-7B mixture-of-experts LLM, nearly matches GPT-4V on both academic benchmarks and human preference evaluations. Specific scores are not provided in the current sources, but it is noted to perform comfortably between GPT-4V and GPT-4o.

Technical Insights

Training Approach: Molmo uses a two-stage training approach: caption generation pre-training followed by supervised fine-tuning on a diverse mixture of datasets. This includes standard academic benchmarks and newly created datasets that enable the models to handle complex real-world tasks like document reading, visual reasoning, and even pointing.
Dataset: The key innovation behind Molmo’s success is the PixMo-Cap dataset, a novel collection of highly detailed image captions gathered from human speech-based descriptions. This dataset comprises 712,000 images with approximately 1.3 million captions.

Performance Highlights

Molmo-72B: Achieves the highest average score (81.2%) across 11 academic benchmarks and ranks second in human preference evaluations, just behind GPT-4o.
MolmoE-1B: Nearly matches GPT-4V on both academic benchmarks and human preference evaluations.
Molmo-7B Models: Perform comfortably between GPT-4V and GPT-4o on both academic benchmarks and user preference.

Open-Source and Accessibility

Release Plan: Ai2 will be releasing all model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available starting today.

Llama 3.2 Models

Meta’s Llama 3.2 models have expanded the capabilities of large language models (LLMs) with enhanced multimodal support. The Llama 3.2 collection includes models of various sizes, from lightweight text-only 1B and 3B parameter models to small and medium-sized 11B and 90B parameter models capable of sophisticated reasoning tasks, including multimodal support for high-resolution images.

领英推荐

Is DeepSeek R1 Right for Your Business?

Plain Concepts 1 个月前

This AI newsletter is all you need #10

Towards AI 2 年前

How does GPT-4o measure up against its competitors?

Thrive 9 个月前

Llama 3.2 1B and 3B: Lightweight text-only models suitable for edge devices and mobile applications, ideal for tasks such as personal information management, multilingual knowledge retrieval, text summarisation, classification, and language translation.
Llama 3.2 11B and 90B: Medium-sized models that support multimodal input, including high-resolution images up to 1120x1120 pixels, enabling tasks like document-level understanding, interpretation of charts and graphs, and image captioning.

Performance Highlights

Llama 3.2 90B-Vision: Matches OpenAI’s GPT-4o on chart understanding (ChartQA) and outperforms Anthropic’s Claude 3 Opus and Google’s Gemini 1.5 Pro on interpreting scientific diagrams (AI2D).
Llama 3.2 11B-Vision: Beats Gemini 1.5 Flash 8B on document visual Q&A (DocVQA), tops Claude 3 Haiku and Claude 3 Sonnet on AI2D, ChartQA, and visual mathematical reasoning (MathVista), and keeps pace with Pixtral 12B and Qwen2-VL 7B on general visual Q&A (VQAv2).
Llama 3.2 3B: Matches the larger Llama 3.1 8B on tool use (BFCL v2) and exceeds it on summarisation (TLDR9+), with the 1B model rivaling both on summarisation and rewriting tasks.

Technical Insights

Training Approach: Llama 3.2 models use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to adapt the model to follow specific instructions and generate more relevant responses.
Multimodal Capabilities: The 11B and 90B Vision models integrate image encoder representations into the language model, enabling tasks that involve both visual and textual data.
Efficiency: All models support grouped-query attention (GQA), which enhances inference speed and efficiency, particularly beneficial for the larger 90B model.

Open-Source and Accessibility

Availability: Llama 3.2 models are available on various platforms, including Amazon Bedrock, Databricks, and IBM’s watsonx.ai, facilitating access and integration for developers.
Customisation: The open-source nature of Llama 3.2 allows for fine-tuning and customisation, enabling developers to create tailored solutions for specific use cases.

Conclusion

The Molmo and Llama 3.2 models represent a notable development in the field of open-source multi-modal AI. Their performance and accessibility offer a competitive alternative to proprietary models, potentially democratising access to advanced AI capabilities and fostering innovation in various applications.

If you found this article informative and valuable, consider sharing it with your network to help others discover the power of AI.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Adelin Saget

5 个月

bom dia qual seu plano por hoje

要查看或添加评论，请登录

Robyn Le Sueur的更多文章

Understanding Vector Databases

2024年10月27日

Understanding Vector Databases

Vector databases are specialized systems designed to efficiently store and manage vector embeddings, which are…
Unlocking Business Potential with AI-Led Processes: Insights from Accenture's Research

2024年10月12日

Unlocking Business Potential with AI-Led Processes: Insights from Accenture's Research

Accenture's comprehensive study, "Reinventing Enterprise Operations with Gen AI," offers an in-depth analysis of how…
Unlocking Advanced Reasoning: A Deep Dive into OpenAI o1 and Q* Reasoning

2024年9月15日

Unlocking Advanced Reasoning: A Deep Dive into OpenAI o1 and Q* Reasoning

The landscape of artificial intelligence has seen a shift with the introduction of OpenAI o1, a new series of AI models…

2 条评论
DeepSeek-V2.5: A Comprehensive Overview

2024年9月7日

DeepSeek-V2.5: A Comprehensive Overview

DeepSeek-V2.5, an upgraded version of DeepSeek, combines the general and coding abilities of DeepSeek-V2-Chat and…
Breaking New Ground: Eagle-7B's RNN-Based LLM Surpasses Transformers

2024年9月3日

Breaking New Ground: Eagle-7B's RNN-Based LLM Surpasses Transformers

In an important development in the field of AI, the Eagle-7B model has achieved a significant milestone by…

2 条评论
Exploring GenAI-Based Productivity Tools: A Comprehensive Guide with Case Studies and Integration Insights

2024年8月31日

Exploring GenAI-Based Productivity Tools: A Comprehensive Guide with Case Studies and Integration Insights

Generative AI (GenAI) is transforming productivity across various industries by streamlining workflows and automating…

1 条评论
Has GenAI Peaked? Three Key Areas of Progress to Watch

2024年8月27日

Has GenAI Peaked? Three Key Areas of Progress to Watch

Generative AI (GenAI) has undergone significant advancements in recent years, prompting discussions about whether it…
Unlocking the Power of Jamba: A New Era in Large Language Models

2024年8月24日

Unlocking the Power of Jamba: A New Era in Large Language Models

The AI community has recently witnessed the introduction of the Jamba 1.5 Model Family, a ground breaking series of…
Microsoft Releases the Phi-3.5 Family of Small Language Models

2024年8月21日

Microsoft Releases the Phi-3.5 Family of Small Language Models

Microsoft has recently announced the release of the Phi-3.5 family of models, which includes the Phi-3.
Understanding Large Language Models: A Beginner's Guide

2024年8月13日

Understanding Large Language Models: A Beginner's Guide

Large language models (LLMs) have become a cornerstone of artificial intelligence, offering remarkable capabilities in…

2 条评论

See all articles

The Rise of Open-Source Multi-Modal Models

Robyn Le Sueur

AI Lead @ ADVANTIQ

Molmo Models

Technical Insights

Performance Highlights

Open-Source and Accessibility

Llama 3.2 Models

领英推荐

Performance Highlights

Technical Insights

Open-Source and Accessibility

Conclusion

Robyn Le Sueur的更多文章

社区洞察

其他会员也浏览了

Gemma 2B Beats GPT-3.5, Taco Bell’s AI Drive-Thrus, and ‘No Fakes’ Laws

Large Language Models (LLMs) and Inference: The Role of Data Centers and Colocation in AI

Claude 3.5 Sonnet Arrives! Can It Outperform GPT-4o and Gemini 1.5 Pro?

AI/ML News Digest | Issue 23

Can AI Explain Company Performance?

Token-Level Detective Reward (TLDR) model by Meta

OpenAI o1 Revolution: Unleashing the Next-Level AI with Mind-Blowing Capabilities

Enhancing Search Capabilities with AI: An Examination of Algolia's Integration of GPT-3

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

A New Approach to Synthetic Image Generation

Molmo Models

Technical Insights

Performance Highlights

Open-Source and Accessibility

Llama 3.2 Models

领英推荐

Performance Highlights

Technical Insights

Open-Source and Accessibility

Conclusion

Robyn Le Sueur的更多文章

Understanding Vector Databases

Unlocking Business Potential with AI-Led Processes: Insights from Accenture's Research

Unlocking Advanced Reasoning: A Deep Dive into OpenAI o1 and Q* Reasoning

DeepSeek-V2.5: A Comprehensive Overview

Breaking New Ground: Eagle-7B's RNN-Based LLM Surpasses Transformers

Exploring GenAI-Based Productivity Tools: A Comprehensive Guide with Case Studies and Integration Insights

Has GenAI Peaked? Three Key Areas of Progress to Watch

Unlocking the Power of Jamba: A New Era in Large Language Models

Microsoft Releases the Phi-3.5 Family of Small Language Models

Understanding Large Language Models: A Beginner's Guide

社区洞察

其他会员也浏览了

Gemma 2B Beats GPT-3.5, Taco Bell’s AI Drive-Thrus, and ‘No Fakes’ Laws

Large Language Models (LLMs) and Inference: The Role of Data Centers and Colocation in AI

Claude 3.5 Sonnet Arrives! Can It Outperform GPT-4o and Gemini 1.5 Pro?

AI/ML News Digest | Issue 23

Can AI Explain Company Performance?

Token-Level Detective Reward (TLDR) model by Meta

OpenAI o1 Revolution: Unleashing the Next-Level AI with Mind-Blowing Capabilities

Enhancing Search Capabilities with AI: An Examination of Algolia's Integration of GPT-3

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

A New Approach to Synthetic Image Generation