登录查看更多内容

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月4日

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.8-billion-parameter language model that outperforms similar-sized models and matches larger models on reasoning tasks, while Phi-4-Multimodal extends these capabilities to handle text, vision, and speech/audio inputs through an innovative mixture-of-LoRAs approach.

Method Overview

The Phi-4 models build upon the previous Phi model family, which demonstrated that carefully curated and synthesized data enables Small Language Models (SLMs) to achieve competitive performance despite having fewer parameters. Phi-4-Mini consists of 32 Transformer layers with a hidden state size of 3,072 and employs Group Query Attention (GQA) to optimize memory usage for long-context generation.

For the multimodal extension, Phi-4-Multimodal introduces a novel "mixture of LoRAs" technique. This approach keeps the base language model entirely frozen while adding modality-specific Low-Rank Adaptation (LoRA) modules for vision and speech/audio processing. Each modality has its own encoder and projector: the vision modality uses a SigLIP-400M-based image encoder with a 2-layer MLP projector, while the speech/audio modality employs an audio encoder with conformer blocks and a similar projector structure.

The training process follows multiple stages. For the language model, Phi-4-Mini is trained on high-quality, reasoning-rich text data, including curated code datasets. For multimodal capabilities, the training includes vision training (projector alignment, joint vision training, generative vision-language training, and multi-frame training), speech/audio training (pre-training on ASR data and post-training on various speech tasks), and vision-speech joint training to enable cross-modal understanding.

For speech and audio, the model can handle inputs up to 30 minutes long (theoretically up to 2.8 hours), making it suitable for tasks like speech summarization. The paper also explores a reasoning-enhanced version of Phi-4-Mini through additional training on reasoning data.

Results

Phi-4-Mini demonstrates remarkable performance across various benchmarks. On language tasks, it outperforms similar-sized models and matches models twice its size, particularly excelling in math and coding tasks. For instance, it achieves 88.6% accuracy on GSM-8K and 74.4% on HumanEval, surpassing many larger models.

Phi-4-Multimodal shows strong performance on vision-language benchmarks, outperforming baseline models of similar size and even surpassing some closed-source models on chart understanding and science reasoning tasks. On vision-speech benchmarks, it significantly outperforms larger models like InternOmni and Gemini-2.0-Flash.

For speech and audio capabilities, Phi-4-Multimodal achieves state-of-the-art performance on ASR tasks, ranking first on the Huggingface OpenASR leaderboard with a 5.5% relative improvement over the previous best model. It also demonstrates strong performance on speech translation, summarization, and audio understanding tasks.

Safety evaluations show that both models perform well in refusing to answer harmful prompts and demonstrate robustness against jailbreak attempts, with performance comparable to or better than similar-sized models.

Conclusion

Phi-4-Mini and Phi-4-Multimodal represent significant advancements in compact language and multimodal models. Through careful data curation, innovative architecture design, and the novel mixture-of-LoRAs approach, these models achieve performance levels that match or exceed much larger models on various tasks. For more information please consult the full paper.

Congrats to the authors for their work!

Microsoft. "Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs." arXiv preprint arXiv:2503.01743, 2025.

AI Paper of the Day

1,303 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

2025年2月26日

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models'…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,303 位关注者

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution