登录查看更多内容

Microsoft Phi: The surprizing power of small language models.

Nelson Fernandez-Pinto

GenAI at Air Liquide | I write about LLMs and Diffusion models

发布日期: 2023年12月22日

This article is a summary of Mojan Javaheripi's talk at Neurips 2023.

Since the creation of transformers, we have seen exponential growth in the size of Language Models. Some speak about Moore's Law of LMs fueled by only a handful of AI Labs with the right hardware. Certainly, the capabilities of LM are irremediably linked to its size. But, what about small and efficient LM? Can they exhibit emergent abilities at a smaller scale? This is the question tackled by the Foundation Models team at Microsoft Research.

Phi SLM family performance in a nutshell

With only 1.3 billion parameters Phi-1.5 exhibits a performance on natural language tasks comparable to models 5x larger. It also outperforms most mid-range LLMs on reasoning tasks such as elementary math and basic coding. All this with a smaller computational footprint: 4x faster inference, 5x less GPU memory, and 26x less training time compared to Llama-7B. Their latest iteration Phi-2 (2.3B parameters) is comparable to 7B and 13B models in commonsense reasoning and even surpasses Llama2 70B in Math & Coding (30x larger model!). Compared with other models of similar size, Phi-2 outperforms Google's Gemini Nano 2 (2.3B) in all benchmarks. On the other hand, Phi models are still in the research stage. So we might expect additional fine-tuning and validation before considering building any application with it. Still, it is quite impressive and promising!

Performance comparison of Phi small language models against larger models in several benchmarks. Source: Textbooks Are All You Need II: phi-1.5 technical report.

The Secret Sauce for building efficient small language models is data, data, data.

As many of you have heard, the key element for training efficient language models is high-quality training data. Just ask yourself, how would you learn good coding skills in less time? Reviewing lots of low-quality random code on Github, or focusing on a small subset of textbook-quality commented code? This is exactly what the Microsoft Foundation models team did to train Phi. Let's see their training recipe in detail.

Filter high educational value data from a bigger dataset

The goal is to create a smaller, textbook-quality dataset on a relevant topic. The Stack dataset contains almost every public source code in Github for a whopping 1T tokens. To find the most valuable examples we can use an LLM, such as GPT4 to scan the dataset using the prompt "Is this code snippet of textbook quality educational value?". The problem is that at this scale this solution is incredibly expensive. So, at least you have 1M$ to spend in Open AI tokens you can just label a small fraction and train your own text classifier (Random Forest, XGboost, etc).

Examples of high (left) and low (right) educational value data. Source: Own creation using

领英推荐

The Journey to LLM Expertise - Part 2: Leading Large…

Data Science Dojo 1 年前

Latest Advancements in RAG Every Developer Should Know!

Pavan Belagatti 1 年前

The Future of AI Tech Stacks

Udit Goenka 6 个月前

Create synthetic textbooks using a larger LLM

To increase the amount of high-quality training data, Microsoft's team used GPT3.5 to generate rich coding examples from natural language. The recipe is the following: Explain a theoretical concept, let's say singular matrices. Generate a concise prompt asking the LLM to code the concept (check if a matrix is singular). Double-check the answer using prompt engineering. The challenge of this type of data generation is to ensure diverse levels of concepts, skills, difficulty, and avoid repetitions. The team found that adding randomness to the prompt helps boost data diversity. To train Phi, the team generated 1B quality tokens using this technique. The results are without appeal, using a small-curated dataset of filtered high-quality data and synthetic textbooks consistently increases the performance of models across all sizes. Interestingly, the yields are higher with the biggest model (Phi-1.5, 1.3B).

Comparison of performance of a small language model on Open AI's human eval with and without using textbook quality data. Source: Textbooks Are All You Need, Gunasekar and Al.

Align the model using synthetic data instructions

The last step is to create a small dataset (CodeExercises) to align your model to perform function completion tasks. At this stage, we focus on finetuning the model so it answers in the way expected by the target evaluation benchmark. In this case, code generation human eval. The advantages of this alignment are staggering, with a +22 points gain for only 200K training tokens. This is frankly excellent for a model of its size.

Secret sauce: Scaling the weights from smaller pre-trained models

The gopher paper explored some techniques of scaling weights from smaller transformer models to bigger ones. The advantage is the reuse of learned information for weight initialization. This results in faster training and a better final performance. The Phi team focused on mapping the weights from a pre-trained Phi small (350M) to a larger Phi-1 (1.3B). To do this, they had to scale the attention layer dimension from 1024 to 2048 and the attention heads from 16 to 32. A first approach called WR (weight reuse), simply maps the pre-trained weights to the larger attention layer and randomly initializes the missing parameters. A second and more elegant approach, called tilling, involves fully initializing the new attention layer with several "tiles" of the smaller pre-trained layer. The results show interesting gains in both training iterations and final accuracy on the Human Eval code generation benchmark.

Comparision of performance on human eval of Phi models trained from scratch, or reusing weights of a smaller pre-trained model. Number of training iterations inside the parenthesis. Source:

And that's it! In this interesting talk at Neurips, we learned about the Phi family of small language models. We saw their impressive performance and detailed recipes for training your own. A relevant insight is that the Phi family is not fully open source, so who's going to train the first one?

I hope you liked this article and follow me for more content on Frontier AI. Have a great week!

Frontier AI

425 位关注者

Jayakrishnan Nair

Product Manager AI Enablers

1 年

Great work Nelson ????

1 次回应

Eddie Storm

?? combing the number line

1 年

Data is the secret sauce! So much potential still left on the table. Great article, Nelson!

1 次回应

Anthony Vouillon

Head of AI COE

1 年

Salut Nelson Fernandez-Pinto, si MR MANGEANT Fabien n’y voit pas d’inconvents, j’aimerais bien que tu puisses nous présenter tes réflexions sur le sujet :-)

3 次回应

查看更多评论

要查看或添加评论，请登录

Nelson Fernandez-Pinto的更多文章

Model merging: Expanding the skills of LLMs on the go

2024年1月23日

Model merging: Expanding the skills of LLMs on the go

Few people know that the Mixture of Experts (MoE) fever in the open source community actually started with a rumor. In…

1 条评论
LLaVA Under the Hood: Introduction to Multimodal LLMs

2024年1月16日

LLaVA Under the Hood: Introduction to Multimodal LLMs

This is a summary of Haotian Liu’s presentation at Neurips 2023 People naturally reason about the visual world and…

1 条评论
Demystifying QLoRA: Finetuning of LLMs in consumer-grade GPUs

2024年1月9日

Demystifying QLoRA: Finetuning of LLMs in consumer-grade GPUs

This article is based on Tim Dettmers' talk at Neurips 2023. Since introducing transformers, the model size has…
What is Artificial General Intelligence and how to achieve it?

2023年12月29日

What is Artificial General Intelligence and how to achieve it?

As for 2023, we all agree that despite the impressive problem-solving capabilities of the current generation of…

3 条评论
Deep Learning for Machine Empathy: Robots and Humans Interaction?—?Part I

2018年5月11日

Deep Learning for Machine Empathy: Robots and Humans Interaction?—?Part I

When we think about the imminent development of the next digital revolution, humanity will face an unprecedented wave…

3 条评论
Visualization of Deep Learning Feature Maps in Mini Autonomous Vehicles

2018年4月23日

Visualization of Deep Learning Feature Maps in Mini Autonomous Vehicles

It’s been a few months since we started building The Axionaut, our mini-autonomous radio controlled (RC) car and raced…

See all articles

Microsoft Phi: The surprizing power of small language models.

Nelson Fernandez-Pinto

GenAI at Air Liquide | I write about LLMs and Diffusion models

Phi SLM family performance in a nutshell

The Secret Sauce for building efficient small language models is data, data, data.

Filter high educational value data from a bigger dataset

领英推荐

Create synthetic textbooks using a larger LLM

Align the model using synthetic data instructions

Secret sauce: Scaling the weights from smaller pre-trained models

Frontier AI

425 位关注者

Nelson Fernandez-Pinto的更多文章

社区洞察

其他会员也浏览了

LLM-Prompting for Mathematical Reasoning; Any-To-Any Multimodel LLM; Understanding LLaMA-2; Boosting RAG; Growth-Zone; and More

Survey of Multimodal LLMs; Meet GOAT-7B-Community Model; AWS’ Amazon Bedrock With More Capabilities; Using OpenAI & Langchain To Build App; and More

Why LLMs Hallucinate; GraphGPT; Inside Microsoft’s small LLM; Deploy Tiny Llama on AWS EC2; Fine-Tune LLM using PyTorch; and More

??Top ML Papers of the Week

Run Large Language Models on Your Local Machine - No Coding Experience Required

Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidential Election Process

Improving Large Language Models Domain-Specific Answers with local long-term Memory. Testing "Cheshire Cat" with my book "Scrum for Hardware"

Unlocking the Power of AI: Getting Started with DeepSeek API

Unlocking the Power of AI: Transforming Your API into a Natural Language-Driven Interface

Introducing Gemma: New Open Source Model from Google outperformed Llama 2 and Mistral Models!

Phi SLM family performance in a nutshell

The Secret Sauce for building efficient small language models is data, data, data.

Filter high educational value data from a bigger dataset

领英推荐

Create synthetic textbooks using a larger LLM

Align the model using synthetic data instructions

Secret sauce: Scaling the weights from smaller pre-trained models

Frontier AI

425 位关注者

Nelson Fernandez-Pinto的更多文章

Model merging: Expanding the skills of LLMs on the go

LLaVA Under the Hood: Introduction to Multimodal LLMs

Demystifying QLoRA: Finetuning of LLMs in consumer-grade GPUs

What is Artificial General Intelligence and how to achieve it?

Deep Learning for Machine Empathy: Robots and Humans Interaction?—?Part I

Visualization of Deep Learning Feature Maps in Mini Autonomous Vehicles

社区洞察

其他会员也浏览了

LLM-Prompting for Mathematical Reasoning; Any-To-Any Multimodel LLM; Understanding LLaMA-2; Boosting RAG; Growth-Zone; and More

Survey of Multimodal LLMs; Meet GOAT-7B-Community Model; AWS’ Amazon Bedrock With More Capabilities; Using OpenAI & Langchain To Build App; and More

Why LLMs Hallucinate; GraphGPT; Inside Microsoft’s small LLM; Deploy Tiny Llama on AWS EC2; Fine-Tune LLM using PyTorch; and More

??Top ML Papers of the Week

Run Large Language Models on Your Local Machine - No Coding Experience Required

Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidential Election Process

Improving Large Language Models Domain-Specific Answers with local long-term Memory. Testing "Cheshire Cat" with my book "Scrum for Hardware"

Unlocking the Power of AI: Getting Started with DeepSeek API

Unlocking the Power of AI: Transforming Your API into a Natural Language-Driven Interface

Introducing Gemma: New Open Source Model from Google outperformed Llama 2 and Mistral Models!