Microsoft Phi: The surprizing power of small language models.
Nelson Fernandez-Pinto
GenAI at Air Liquide | I write about LLMs and Diffusion models
This article is a summary of Mojan Javaheripi's talk at Neurips 2023.
Since the creation of transformers, we have seen exponential growth in the size of Language Models. Some speak about Moore's Law of LMs fueled by only a handful of AI Labs with the right hardware. Certainly, the capabilities of LM are irremediably linked to its size. But, what about small and efficient LM? Can they exhibit emergent abilities at a smaller scale? This is the question tackled by the Foundation Models team at Microsoft Research.
Phi SLM family performance in a nutshell
With only 1.3 billion parameters Phi-1.5 exhibits a performance on natural language tasks comparable to models 5x larger. It also outperforms most mid-range LLMs on reasoning tasks such as elementary math and basic coding. All this with a smaller computational footprint: 4x faster inference, 5x less GPU memory, and 26x less training time compared to Llama-7B. Their latest iteration Phi-2 (2.3B parameters) is comparable to 7B and 13B models in commonsense reasoning and even surpasses Llama2 70B in Math & Coding (30x larger model!). Compared with other models of similar size, Phi-2 outperforms Google's Gemini Nano 2 (2.3B) in all benchmarks. On the other hand, Phi models are still in the research stage. So we might expect additional fine-tuning and validation before considering building any application with it. Still, it is quite impressive and promising!
The Secret Sauce for building efficient small language models is data, data, data.
As many of you have heard, the key element for training efficient language models is high-quality training data. Just ask yourself, how would you learn good coding skills in less time? Reviewing lots of low-quality random code on Github, or focusing on a small subset of textbook-quality commented code? This is exactly what the Microsoft Foundation models team did to train Phi. Let's see their training recipe in detail.
Filter high educational value data from a bigger dataset
The goal is to create a smaller, textbook-quality dataset on a relevant topic. The Stack dataset contains almost every public source code in Github for a whopping 1T tokens. To find the most valuable examples we can use an LLM, such as GPT4 to scan the dataset using the prompt "Is this code snippet of textbook quality educational value?". The problem is that at this scale this solution is incredibly expensive. So, at least you have 1M$ to spend in Open AI tokens you can just label a small fraction and train your own text classifier (Random Forest, XGboost, etc).
领英推荐
Create synthetic textbooks using a larger LLM
To increase the amount of high-quality training data, Microsoft's team used GPT3.5 to generate rich coding examples from natural language. The recipe is the following: Explain a theoretical concept, let's say singular matrices. Generate a concise prompt asking the LLM to code the concept (check if a matrix is singular). Double-check the answer using prompt engineering. The challenge of this type of data generation is to ensure diverse levels of concepts, skills, difficulty, and avoid repetitions. The team found that adding randomness to the prompt helps boost data diversity. To train Phi, the team generated 1B quality tokens using this technique. The results are without appeal, using a small-curated dataset of filtered high-quality data and synthetic textbooks consistently increases the performance of models across all sizes. Interestingly, the yields are higher with the biggest model (Phi-1.5, 1.3B).
Align the model using synthetic data instructions
The last step is to create a small dataset (CodeExercises) to align your model to perform function completion tasks. At this stage, we focus on finetuning the model so it answers in the way expected by the target evaluation benchmark. In this case, code generation human eval. The advantages of this alignment are staggering, with a +22 points gain for only 200K training tokens. This is frankly excellent for a model of its size.
Secret sauce: Scaling the weights from smaller pre-trained models
The gopher paper explored some techniques of scaling weights from smaller transformer models to bigger ones. The advantage is the reuse of learned information for weight initialization. This results in faster training and a better final performance. The Phi team focused on mapping the weights from a pre-trained Phi small (350M) to a larger Phi-1 (1.3B). To do this, they had to scale the attention layer dimension from 1024 to 2048 and the attention heads from 16 to 32. A first approach called WR (weight reuse), simply maps the pre-trained weights to the larger attention layer and randomly initializes the missing parameters. A second and more elegant approach, called tilling, involves fully initializing the new attention layer with several "tiles" of the smaller pre-trained layer. The results show interesting gains in both training iterations and final accuracy on the Human Eval code generation benchmark.
And that's it! In this interesting talk at Neurips, we learned about the Phi family of small language models. We saw their impressive performance and detailed recipes for training your own. A relevant insight is that the Phi family is not fully open source, so who's going to train the first one?
I hope you liked this article and follow me for more content on Frontier AI. Have a great week!
Product Manager AI Enablers
1 年Great work Nelson ????
?? combing the number line
1 年Data is the secret sauce! So much potential still left on the table. Great article, Nelson!
Head of AI COE
1 年Salut Nelson Fernandez-Pinto, si MR MANGEANT Fabien n’y voit pas d’inconvents, j’aimerais bien que tu puisses nous présenter tes réflexions sur le sujet :-)