Pixtral 12B
Today's paper introduces Pixtral 12B, a 12-billion-parameter multimodal language model capable of understanding both images and text. This model achieves leading performance on various multimodal benchmarks, surpassing many larger models while maintaining strong text-only reasoning capabilities. Pixtral 12B uses a vision encoder that allows it to process images at their native resolution and aspect ratio, providing flexibility in image processing.
Method Overview
Pixtral 12B is a multimodal language model trained to comprehend both images and text. The model is instruction-tuned and pretrained on large-scale interleaved image and text documents, enabling it to engage in multi-turn, multi-image conversations.
A key feature of Pixtral 12B is its new vision encoder, which is trained using a novel ROPE-2D implementation. This allows the model to process images at their native resolution and aspect ratio, providing flexibility in image processing. Users can choose to process images at low resolution for latency-constrained settings or at high resolution when fine-grained reasoning is required.
The model is designed to handle any number of images within its long context window of 128K tokens, making it suitable for complex multimodal tasks. Despite its focus on multimodal capabilities, Pixtral 12B maintains strong performance on text-only tasks, matching or exceeding the performance of comparable models across various benchmarks.
Results
Pixtral 12B demonstrates impressive performance across various multimodal benchmarks:
Conclusion
Pixtral 12B represents a significant advancement in multimodal language models, offering strong performance on both image-text and text-only tasks while maintaining a relatively small model size. For more information please consult the?full paper.
Congrats to the authors for their work!
"Pixtral 12B." arXiv:2410.07073v1 [cs.CV], 9 Oct. 2024, arxiv.org/abs/2410.07073.