Pixtral 12B

Pixtral 12B

Today's paper introduces Pixtral 12B, a 12-billion-parameter multimodal language model capable of understanding both images and text. This model achieves leading performance on various multimodal benchmarks, surpassing many larger models while maintaining strong text-only reasoning capabilities. Pixtral 12B uses a vision encoder that allows it to process images at their native resolution and aspect ratio, providing flexibility in image processing.

Method Overview

Pixtral 12B is a multimodal language model trained to comprehend both images and text. The model is instruction-tuned and pretrained on large-scale interleaved image and text documents, enabling it to engage in multi-turn, multi-image conversations.

A key feature of Pixtral 12B is its new vision encoder, which is trained using a novel ROPE-2D implementation. This allows the model to process images at their native resolution and aspect ratio, providing flexibility in image processing. Users can choose to process images at low resolution for latency-constrained settings or at high resolution when fine-grained reasoning is required.

The model is designed to handle any number of images within its long context window of 128K tokens, making it suitable for complex multimodal tasks. Despite its focus on multimodal capabilities, Pixtral 12B maintains strong performance on text-only tasks, matching or exceeding the performance of comparable models across various benchmarks.

Results

Pixtral 12B demonstrates impressive performance across various multimodal benchmarks:

  1. It outperforms other open models of similar sizes, such as Llama-3.2 11B and Qwen-2-VL 7B.
  2. The model surpasses much larger open models like Llama-3.2 90B while being 7 times smaller.
  3. Pixtral 12B excels in multimodal instruction following, ranking highest among Apache 2.0 models on the LMSys Vision Leaderboard.
  4. It outperforms several closed models, including Claude-3 Haiku and Gemini-1.5 Flash 8B, on multimodal benchmarks.
  5. The model maintains strong performance on text-only tasks, matching or exceeding other models on benchmarks like MATH and HumanEval.

Conclusion

Pixtral 12B represents a significant advancement in multimodal language models, offering strong performance on both image-text and text-only tasks while maintaining a relatively small model size. For more information please consult the?full paper.

Congrats to the authors for their work!

"Pixtral 12B." arXiv:2410.07073v1 [cs.CV], 9 Oct. 2024, arxiv.org/abs/2410.07073.

要查看或添加评论,请登录

Vlad Bogolin的更多文章