Pixtral 12B

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年10月10日

Today's paper introduces Pixtral 12B, a 12-billion-parameter multimodal language model capable of understanding both images and text. This model achieves leading performance on various multimodal benchmarks, surpassing many larger models while maintaining strong text-only reasoning capabilities. Pixtral 12B uses a vision encoder that allows it to process images at their native resolution and aspect ratio, providing flexibility in image processing.

Method Overview

Pixtral 12B is a multimodal language model trained to comprehend both images and text. The model is instruction-tuned and pretrained on large-scale interleaved image and text documents, enabling it to engage in multi-turn, multi-image conversations.

A key feature of Pixtral 12B is its new vision encoder, which is trained using a novel ROPE-2D implementation. This allows the model to process images at their native resolution and aspect ratio, providing flexibility in image processing. Users can choose to process images at low resolution for latency-constrained settings or at high resolution when fine-grained reasoning is required.

The model is designed to handle any number of images within its long context window of 128K tokens, making it suitable for complex multimodal tasks. Despite its focus on multimodal capabilities, Pixtral 12B maintains strong performance on text-only tasks, matching or exceeding the performance of comparable models across various benchmarks.

Results

Pixtral 12B demonstrates impressive performance across various multimodal benchmarks:

It outperforms other open models of similar sizes, such as Llama-3.2 11B and Qwen-2-VL 7B.
The model surpasses much larger open models like Llama-3.2 90B while being 7 times smaller.
Pixtral 12B excels in multimodal instruction following, ranking highest among Apache 2.0 models on the LMSys Vision Leaderboard.
It outperforms several closed models, including Claude-3 Haiku and Gemini-1.5 Flash 8B, on multimodal benchmarks.
The model maintains strong performance on text-only tasks, matching or exceeding other models on benchmarks like MATH and HumanEval.

Conclusion

Pixtral 12B represents a significant advancement in multimodal language models, offering strong performance on both image-text and text-only tasks while maintaining a relatively small model size. For more information please consult the?full paper.

Congrats to the authors for their work!

"Pixtral 12B." arXiv:2410.07073v1 [cs.CV], 9 Oct. 2024, arxiv.org/abs/2410.07073.

AI Paper of the Day

915 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Today's paper introduces VisRAG, a new approach to retrieval-augmented generation (RAG) that leverages vision-language…
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Today's paper introduces LOKI, a comprehensive benchmark for evaluating large multimodal models (LMMs) on synthetic…
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Today's paper introduces VITask, a new framework for adapting large vision language models (VLMs) to specific tasks. It…
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Today's paper investigates the challenges of using long-context large language models (LLMs) in retrieval-augmented…
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Today's paper introduces MLLM As ReTriever (MART), a new method for enhancing the performance of embodied agents in…
Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Aria: An Open Multimodal Native Mixture-of-Experts Model

Today's paper introduces ARIA, an open multimodal native mixture-of-experts model with state-of-the-art performance…
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Today's paper introduces VideoGuide, a new framework for improving the temporal consistency of pretrained text-to-video…
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Today's paper explores the internal representations of large language models (LLMs) to better understand and detect…
Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

Today's paper addresses a critical issue in Large Vision-Language Models (LVLMs): cross-modality parametric knowledge…
LLaVA-Critic: Learning to Evaluate Multimodal Models

2024年10月6日

LLaVA-Critic: Learning to Evaluate Multimodal Models

Today's paper introduces LLaVA-Critic, an open-source large multimodal model (LMM) designed as a generalist evaluator…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

915 位关注者

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Aria: An Open Multimodal Native Mixture-of-Experts Model

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

LLaVA-Critic: Learning to Evaluate Multimodal Models