Unlocking the Power of Small Language Models (SLMs): Evolution of Phi
Gemini: Power of SLMs in Dali style

Unlocking the Power of Small Language Models (SLMs): Evolution of Phi

How the Phi Models are changing Natural Language Processing through Data Curation, Training Methodology, and Architectural Optimizations

The Phi family of language models, developed by Microsoft Research's Machine Learning Foundations team, has showcased the remarkable potential of small language models (SLMs) to achieve state-of-the-art performance on various benchmarks. Through strategic choices in training data quality, scaling techniques, and architectural optimizations, the Phi models have demonstrated capabilities that rival and even surpass those of much larger models. This essay will delve into the technical details of Phi-1, Phi-2, and Phi-3, exploring their innovations and contributions to the field of natural language processing.

?Phi-1, the first model in the series, is a 1.3 billion parameter language model that achieved state-of-the-art performance on Python coding tasks, specifically on the HumanEval and MBPP benchmarks. The model architecture consists of a decoder-only transformer with 24 layers, a hidden dimension of 2048, an MLP dimension of 8192, and 32 attention heads of 64 dimensions each. Phi-1 utilizes rotary position embeddings with a rotary dimension of 32 and is implemented using Flash Attention for efficient attention computation. The key insight behind Phi-1's success lies in its training data, which follows the approach outlined in the "Textbooks Are All You Need" paper. By focusing on "textbook-quality" data, Phi-1 was able to efficiently learn coding concepts and skills from a carefully curated dataset consisting of filtered code samples from sources like The Stack (approximately 6B tokens) and synthetically generated coding textbooks (around 1B tokens) and exercises (about 180M tokens) using GPT-3.5. The model was trained using AdamW optimizer with linear warmup and decay learning rate scheduling, as well as attention and residual dropout of 0.1, for a total of 51B tokens over 8 passes on the data.


Pass@1 accuracy (%) on HumanEval.
Prompt Response

Building upon the success of Phi-1, the team developed Phi-1.5, another 1.3 billion parameter model that extended the focus to common sense reasoning and language understanding. Phi-1.5 demonstrated performance comparable to models 5 times its size, thanks to its training on a diverse dataset that included synthetic data designed to teach common sense reasoning and general knowledge across various domains, such as science, daily activities, and theory of mind. The model architecture and training methodology remained like Phi-1, but with an expanded training dataset that incorporated more diverse web sources and synthetic data generated using GPT-3.5.

?Phi-2, with 2.7 billion parameters, further pushed the boundaries of what SLMs can achieve. The model architecture is like Phi-1 and Phi-1.5, with the main difference being the increased number of parameters. By scaling up from Phi-1.5 and employing innovative techniques like scaled knowledge transfer, Phi-2 was able to match or outperform models up to 25 times larger on complex benchmarks. The training data for Phi-2 consisted of 1.4T tokens from multiple passes on a mixture of synthetic and web datasets for NLP and coding, with the web data being carefully filtered based on educational value and content quality. The synthetic datasets were generated using GPT-3.5 and designed to teach common sense reasoning, general knowledge, and coding skills. Phi-2's training took 14 days on 96 A100 GPUs, using the AdamW optimizer with a learning rate of 1e-4, a batch size of 1024, and a sequence length of 2048. Despite being a base model without alignment through reinforcement learning from human feedback (RLHF) or instruct fine-tuning, Phi-2 exhibited better behavior with respect to toxicity and bias compared to existing open-source models that underwent alignment, as measured by the ToxiGen benchmark.

Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively

The most recent addition to the Phi family is Phi-3, which includes models like Phi-3-mini (3.8B parameters), Phi-3-small (7B parameters), and Phi-3-medium (14B parameters). Phi-3-mini has garnered significant attention for its ability to achieve performance rivaling models like GPT-3.5 and Mistral 8x7B while being small enough to run locally on a smartphone. This breakthrough was made possible by further refining the data curation and training methodology introduced in Phi-1 and Phi-2.

Phi-3-mini's architecture is a transformer decoder model with 32 heads, 32 layers, and a hidden dimension of 3072. It has a context length of 4K, which can be extended to 128K using the Long Rope technique (Phi-3-mini-128K). The model is trained on 3.3T tokens using bfloat16 precision and shares some architectural similarities with Llama-2, such as the block structure and vocabulary size of 32,064, for compatibility with existing tools and packages.

Prompt examples on phi-3

Phi-3-mini's training data consists of a scaled-up version of the Phi-2 dataset, with even more stringent filtering of web data and synthetic data generation. The training process is divided into two phases: the first phase focuses on teaching the model general knowledge and language understanding using filtered web sources, while the second phase incorporates highly filtered web data and synthetic data to develop the model's reasoning abilities and niche skills. The synthetic data is generated using GPT-3.5 and designed to teach common sense reasoning, general knowledge, and coding skills, with a focus on maximizing the model's reasoning capabilities within its limited capacity. This "data optimal" approach allows Phi-3-mini to achieve impressive performance despite its small size.

Toy illustration of the blocksparse attention in phi-3-small with 2 local blocks and vertical stride of 3. The table shows the Keys/values a query token in block 8 attended to. Blue=local blocks, orange=remote/vertical blocks, gray=blocks skipped.

The larger Phi-3-small and Phi-3-medium models incorporate additional architectural optimizations to improve performance and efficiency. Phi-3-small is a 7B parameter model that uses the tiktoken tokenizer with a vocabulary size of 100,352 and has a default context length of 8,192. It follows the standard decoder architecture with 32 heads, 32 layers, and a hidden size of 4,096. The model employs GEGLU activation instead of GELU and uses Maximal Update Parametrization (muP) to tune hyperparameters on a small proxy model before transferring them to the target 7B model. Phi-3-small also utilizes grouped-query attention, where 4 queries share 1 key, and a novel blocks parse attention module to optimize training and inference speed. The blocksparse attention enforces different sparsity patterns over the key-value cache for each attention head, ensuring that all tokens are attended to while significantly reducing the cache size. The model alternates between dense attention layers and blocks parse attention layers to balance cache savings and long context retrieval performance.

?Phi-3-medium is a 14B parameter model that follows the same architecture as Phi-3-mini, with 40 heads, 40 layers, and an embedding dimension of 5,120. It is trained on the same data as Phi-3-small for 4.8T tokens using the AdamW optimizer with a learning rate of 1e-4, a batch size of 1024, and a sequence length of 2048.

Post-training plays a crucial role in the performance and safety of Phi-3 models. Phi-3-mini undergoes supervised fine-tuning (SFT) on diverse domains to improve task-specific performance and direct preference optimization (DPO) for alignment with human preferences and safe, helpful, and truthful responses. The SFT data covers a wide range of topics, including math, coding, reasoning, conversation, model identity, and safety, while the DPO data focuses on chat format, reasoning, and responsible AI efforts. This post-training process enables Phi-3-mini to be deployed as a safe and effective language model for various applications.

Comparison of harmful response percentages by Microsoft AI Red Team between phi-3-mini before and after the safety alignment.

The performance of Phi-3 models on academic benchmarks is impressive, with Phi-3-mini achieving 69% accuracy on the MMLU benchmark and an average score of 8.38 on MT-bench. The larger Phi-3-small and Phi-3-medium models show further improvements, with Phi-3-small achieving 75.7% on MMLU and 8.70 on MT-bench, and Phi-3-medium achieving 78.0% on MMLU and 8.91 on MT-bench, although with diminishing returns beyond 7B parameters. One of the most exciting aspects of Phi-3-mini is its ability to be quantized to 4 bits, reducing its memory footprint to approximately 1.8GB, which allows for local inference on modern smartphones and edge devices at a speed of 12 tokens per second.

Alongside the Phi-3 language models, the team also developed Phi-3-vision, a 4.2B parameter multimodal model that combines a CLIP ViT-L/14 image encoder with Phi-3-mini-128K. This model is trained on a diverse dataset of interleaved image and text data, OCR from PDFs, charts, tables, and more. The visual tokens extracted by the image encoder are combined with text tokens in an interleaved manner, and a dynamic cropping strategy is used to split high-resolution images into a 2D array of blocks to handle various aspect ratios. Phi-3-vision undergoes pre-training on 0.5T tokens of both visual and text elements, with the maximum image resolution capped at 1344 × 1344. The model then goes through supervised fine-tuning (SFT) on a mixture of text and multimodal data covering general image understanding, chart/table/diagram reasoning, PowerPoint understanding, and safety, followed by direct preference optimization (DPO) on text and multimodal data for alignment with human preferences. Phi-3-vision demonstrates strong performance on multimodal reasoning benchmarks, such as MMMU, ScienceQA, MathVista, and more.

The demo case shows Phi-3-Vision’s capability in natural image understanding and reasoning.

The Phi family of language models has made significant strides in pushing the boundaries of what small language models can achieve. Through innovations in data curation, training methodology, and architectural optimizations, these models have demonstrated remarkable performance on a wide range of benchmarks, often rivaling or surpassing much larger models. The availability of models like Phi-2 and Phi-3-mini has opened exciting opportunities for researchers to further explore and develop language models for various applications, while their compact size enables deployment on resource-constrained devices.

?As the field of natural language processing continues to evolve, the insights gained from the development of the Phi models will undoubtedly shape future research and development efforts. By emphasizing the importance of data quality, exploring efficient scaling techniques, and prioritizing safety and alignment, the Phi family of models has set a new standard for what small language models can achieve, paving the way for more accessible, efficient, and capable language models in the future. However, challenges remain in addressing factual inaccuracies, mitigating biases, ensuring safety and robustness, and extending these models to multilingual and multimodal tasks. Nonetheless, the Phi models represent a significant milestone in the development of small language models and showcase the potential for achieving state-of-the-art performance through strategic choices in data curation, training methodology, and architectural design.

References

  1. Gunasekar, Suriya, et al. "Textbooks are all you need." arXiv preprint arXiv:2306.11644 (2023).
  2. Javaheripi, Mojan, et al. "Phi-2: The surprising power of small language models." Microsoft Research Blog (2023).
  3. Abdin, Marah, et al. "Phi-3 technical report: A highly capable language model locally on your phone." arXiv preprint arXiv:2404.14219 (2024).

Arun C.

Senior Data Scientist

3 个月

Dr. Raghavan, Thank you for your insightful article on the Phi models. Your comprehensive examination of data curation, training methodologies, and architectural optimizations in developing small language models (SLMs) is truly informative. The Phi series demonstrates remarkable advancements in performance through the use of textbook-quality data and innovative scaling techniques, enhancing both their efficiency and practical applicability. I particularly appreciate the potential of these models for deployment in resource-limited environments and their capability to maintain high accuracy, even in constrained settings like mobile devices. Furthermore, the focus on safety and bias mitigation in these models is commendable, setting new standards for responsible AI development. The evolution of the Phi models highlights the significant impact of strategic data and architectural choices in advancing the field of natural language processing.

回复
Malur Narayan

Technology Leader l Responsible AI | AI/ML solutions & Governance | Networks I Digital | Sustainability | Board Member | Mental Health advocate

3 个月

Excellent summary Vijay. I’m curious based on the capabilities of each model, what are the most promising use cases you have seen implemented? From what I can tell, Phi-1 excels in basic Python tasks, making it ideal for coding challenges. Phi-1.5 with common-sense reasoning, seems suitable for chatbots. Phi-2, with its near-human performance, seems perfect for complex language and reasoning tasks.

回复

要查看或添加评论,请登录

Vijay Raghavan Ph.D., M.B.A.,的更多文章

社区洞察

其他会员也浏览了