Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Credit: https://arxiv.org/pdf/2503.01743

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.8-billion-parameter language model that outperforms similar-sized models and matches larger models on reasoning tasks, while Phi-4-Multimodal extends these capabilities to handle text, vision, and speech/audio inputs through an innovative mixture-of-LoRAs approach.

Method Overview

The Phi-4 models build upon the previous Phi model family, which demonstrated that carefully curated and synthesized data enables Small Language Models (SLMs) to achieve competitive performance despite having fewer parameters. Phi-4-Mini consists of 32 Transformer layers with a hidden state size of 3,072 and employs Group Query Attention (GQA) to optimize memory usage for long-context generation.

For the multimodal extension, Phi-4-Multimodal introduces a novel "mixture of LoRAs" technique. This approach keeps the base language model entirely frozen while adding modality-specific Low-Rank Adaptation (LoRA) modules for vision and speech/audio processing. Each modality has its own encoder and projector: the vision modality uses a SigLIP-400M-based image encoder with a 2-layer MLP projector, while the speech/audio modality employs an audio encoder with conformer blocks and a similar projector structure.

The training process follows multiple stages. For the language model, Phi-4-Mini is trained on high-quality, reasoning-rich text data, including curated code datasets. For multimodal capabilities, the training includes vision training (projector alignment, joint vision training, generative vision-language training, and multi-frame training), speech/audio training (pre-training on ASR data and post-training on various speech tasks), and vision-speech joint training to enable cross-modal understanding.

For speech and audio, the model can handle inputs up to 30 minutes long (theoretically up to 2.8 hours), making it suitable for tasks like speech summarization. The paper also explores a reasoning-enhanced version of Phi-4-Mini through additional training on reasoning data.

Results

Phi-4-Mini demonstrates remarkable performance across various benchmarks. On language tasks, it outperforms similar-sized models and matches models twice its size, particularly excelling in math and coding tasks. For instance, it achieves 88.6% accuracy on GSM-8K and 74.4% on HumanEval, surpassing many larger models.

Phi-4-Multimodal shows strong performance on vision-language benchmarks, outperforming baseline models of similar size and even surpassing some closed-source models on chart understanding and science reasoning tasks. On vision-speech benchmarks, it significantly outperforms larger models like InternOmni and Gemini-2.0-Flash.

For speech and audio capabilities, Phi-4-Multimodal achieves state-of-the-art performance on ASR tasks, ranking first on the Huggingface OpenASR leaderboard with a 5.5% relative improvement over the previous best model. It also demonstrates strong performance on speech translation, summarization, and audio understanding tasks.

Safety evaluations show that both models perform well in refusing to answer harmful prompts and demonstrate robustness against jailbreak attempts, with performance comparable to or better than similar-sized models.

Conclusion

Phi-4-Mini and Phi-4-Multimodal represent significant advancements in compact language and multimodal models. Through careful data curation, innovative architecture design, and the novel mixture-of-LoRAs approach, these models achieve performance levels that match or exceed much larger models on various tasks. For more information please consult the full paper.

Congrats to the authors for their work!

Microsoft. "Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs." arXiv preprint arXiv:2503.01743, 2025.

要查看或添加评论,请登录

Vlad Bogolin的更多文章