Microsoft Phi: The surprizing power of small language models.
Source: Generated with DALL·E 3 using Bing's free image creator.

Microsoft Phi: The surprizing power of small language models.

This article is a summary of Mojan Javaheripi's talk at Neurips 2023.


Since the creation of transformers, we have seen exponential growth in the size of Language Models. Some speak about Moore's Law of LMs fueled by only a handful of AI Labs with the right hardware. Certainly, the capabilities of LM are irremediably linked to its size. But, what about small and efficient LM? Can they exhibit emergent abilities at a smaller scale? This is the question tackled by the Foundation Models team at Microsoft Research.


Phi SLM family performance in a nutshell

With only 1.3 billion parameters Phi-1.5 exhibits a performance on natural language tasks comparable to models 5x larger. It also outperforms most mid-range LLMs on reasoning tasks such as elementary math and basic coding. All this with a smaller computational footprint: 4x faster inference, 5x less GPU memory, and 26x less training time compared to Llama-7B. Their latest iteration Phi-2 (2.3B parameters) is comparable to 7B and 13B models in commonsense reasoning and even surpasses Llama2 70B in Math & Coding (30x larger model!). Compared with other models of similar size, Phi-2 outperforms Google's Gemini Nano 2 (2.3B) in all benchmarks. On the other hand, Phi models are still in the research stage. So we might expect additional fine-tuning and validation before considering building any application with it. Still, it is quite impressive and promising!

Performance comparison of Phi small language models against larger models in several benchmarks. Source: Textbooks Are All You Need II: phi-1.5 technical report.



The Secret Sauce for building efficient small language models is data, data, data.

As many of you have heard, the key element for training efficient language models is high-quality training data. Just ask yourself, how would you learn good coding skills in less time? Reviewing lots of low-quality random code on Github, or focusing on a small subset of textbook-quality commented code? This is exactly what the Microsoft Foundation models team did to train Phi. Let's see their training recipe in detail.



Filter high educational value data from a bigger dataset

The goal is to create a smaller, textbook-quality dataset on a relevant topic. The Stack dataset contains almost every public source code in Github for a whopping 1T tokens. To find the most valuable examples we can use an LLM, such as GPT4 to scan the dataset using the prompt "Is this code snippet of textbook quality educational value?". The problem is that at this scale this solution is incredibly expensive. So, at least you have 1M$ to spend in Open AI tokens you can just label a small fraction and train your own text classifier (Random Forest, XGboost, etc).


Examples of high (left) and low (right) educational value data. Source: Own creation using



Create synthetic textbooks using a larger LLM

To increase the amount of high-quality training data, Microsoft's team used GPT3.5 to generate rich coding examples from natural language. The recipe is the following: Explain a theoretical concept, let's say singular matrices. Generate a concise prompt asking the LLM to code the concept (check if a matrix is singular). Double-check the answer using prompt engineering. The challenge of this type of data generation is to ensure diverse levels of concepts, skills, difficulty, and avoid repetitions. The team found that adding randomness to the prompt helps boost data diversity. To train Phi, the team generated 1B quality tokens using this technique. The results are without appeal, using a small-curated dataset of filtered high-quality data and synthetic textbooks consistently increases the performance of models across all sizes. Interestingly, the yields are higher with the biggest model (Phi-1.5, 1.3B).

Comparison of performance of a small language model on Open AI's human eval with and without using textbook quality data. Source: Textbooks Are All You Need, Gunasekar and Al.



Align the model using synthetic data instructions

The last step is to create a small dataset (CodeExercises) to align your model to perform function completion tasks. At this stage, we focus on finetuning the model so it answers in the way expected by the target evaluation benchmark. In this case, code generation human eval. The advantages of this alignment are staggering, with a +22 points gain for only 200K training tokens. This is frankly excellent for a model of its size.



Secret sauce: Scaling the weights from smaller pre-trained models

The gopher paper explored some techniques of scaling weights from smaller transformer models to bigger ones. The advantage is the reuse of learned information for weight initialization. This results in faster training and a better final performance. The Phi team focused on mapping the weights from a pre-trained Phi small (350M) to a larger Phi-1 (1.3B). To do this, they had to scale the attention layer dimension from 1024 to 2048 and the attention heads from 16 to 32. A first approach called WR (weight reuse), simply maps the pre-trained weights to the larger attention layer and randomly initializes the missing parameters. A second and more elegant approach, called tilling, involves fully initializing the new attention layer with several "tiles" of the smaller pre-trained layer. The results show interesting gains in both training iterations and final accuracy on the Human Eval code generation benchmark.

Comparision of performance on human eval of Phi models trained from scratch, or reusing weights of a smaller pre-trained model. Number of training iterations inside the parenthesis. Source:



And that's it! In this interesting talk at Neurips, we learned about the Phi family of small language models. We saw their impressive performance and detailed recipes for training your own. A relevant insight is that the Phi family is not fully open source, so who's going to train the first one?

I hope you liked this article and follow me for more content on Frontier AI. Have a great week!

Jayakrishnan Nair

Product Manager AI Enablers

1 年

Great work Nelson ????

Eddie Storm

?? combing the number line

1 年

Data is the secret sauce! So much potential still left on the table. Great article, Nelson!

Salut Nelson Fernandez-Pinto, si MR MANGEANT Fabien n’y voit pas d’inconvents, j’aimerais bien que tu puisses nous présenter tes réflexions sur le sujet :-)

要查看或添加评论,请登录

Nelson Fernandez-Pinto的更多文章

社区洞察

其他会员也浏览了