登录查看更多内容

Introduction to Distilled Models

Junling Hu

AI Tech Leader

发布日期: 2021年11月14日

In recent years (in fact the last 2 years), we have seen distilled BERT (DistilBERT) and distilled GPT models. We also start seeing distilled models for image networks (see DEIT in the reference).?What are these models? How is a distilled model different from the original model?

The idea of distilling a model, more specifically distilling a neural network, was proposed by Geoffrey Hinton in 2015. As one of the founders of the deep learning field, Hinton was concerned with the large size of neural networks. He asked: Is it possible to train a large neural network and then distill the knowledge to a smaller neural network? In other words, the large neural network acts as a teacher, and the smaller network is a student. We would like the smaller network to copy the essential weights learned by the large network. But what weights do we copy?

Hinton and his co-authors designed a training method that allows the smaller network to learn its weights by observing the teacher’s behavior (in this case, the prediction of the network). The objective of training a neural network is to minimize its error. Typically, this is measured by the distance between the model prediction and the real label. In distillation or distilled learning, a neural network is minimizing not only the error against the real label, but also the error against the teacher’s prediction. This may be counterintuitive, as we assume the real label is sufficient to train a model. By adding the error against the teacher’s prediction, we can magnify the error of the student model, and thus train the (student) network faster.

Hinton regarded the error against the real label as hard loss (in the sense of real loss), and the error against the teacher prediction as soft loss (the teacher can be wrong). Our goal is to minimize the weighted sum of these two losses. The weight assigned to the soft loss is called soft weight, represented by the symbol λ.?The total loss to be minimized is the following:

Loss=(1-λ) L_hard + λ L_soft

Hinton introduced another parameter called temperature (called T). This adjusted the softmax function at the output layer. A softmax function is also called normalized exponential function. It converts a node value by taking the exponential of the value and then normalizing it with the sum of exponential values of all the nodes in the same layer. For example, for the node i with value y_i ,?the softmax of this value is defined as:?

As we can see, this function transforms any node value to a number between 0 and 1. Since the new values of all the nodes add up to 1, they can be interpreted as probabilities. Hinton introduced a new softmax function that has a parameter T (temperature), and it’s written as:

When T is 1, this is equivalent to the standard softmax function. When T is greater than 1, the exponential value is smaller, and this is called a “softer probability distribution over classes”, where classes refer to the output classes.

The new softmax function (with temperature T) is applied to both the teacher and student, thus we need to adjust this “softening” of distribution by multiplying T^2?to the soft loss. Our total loss is now:??

Loss=(1-λ) L_hard + λ T^2 L_soft

With this new loss function, we can train a distilled neural network.?

DistilBERT

DistilBERT adopts the distillation method from Hinton’s paper with a little modification. It added one more loss to the total loss called cosine embedding loss, which is the distance between the student and the teacher’s embedding vectors. (Another small change in terminology is that the distilBERT paper calls the soft loss distillation loss).??

The distilled model (DistilBERT) has 6 layers, and 66 million parameters. BERT-base has 12 layers, 110 million parameters. The distilled model was re-trained on the same dataset of the BERT model (The English Wikipedia and the Toronto book corpus).

For the GLUE (Generalized Language Understanding Evaluation) tasks, BERT-base achieved average accuracy?of 79.5% while DistilBERT achieved 77%. For the SQuAD dataset, BERT-base achieved 88.5% F1 score, and DistilBERT achieved around 86%. Therefore BERT-base is about 2.5 percentage points better than DistilBERT.

DistilGPT2

DistillGPT2 models after the original GPT-2, which came with 4 different sizes, with different numbers of layers.??

领英推荐

Gradient Descent and Backpropagation in Artificial…

Doug Rose 3 周前

Neural Network Chain Rule: Understanding the…

Doug Rose 9 个月前

Deep Learning Neural Network simple way to explain

Prem Vishnoi 1 个月前

The smallest GPT-2 has 12 layers and 117 million parameters (Hugging Face reports this to be 124 million parameters, maybe due to different implementation). The distilled model has 6 layers and 82 million parameters. The (word) embedding size for the smallest GPT-2 is 768, and distilGPT2 has the same embedding size of 768.

While distillGPT2 is twice as fast as GPT-2, its perplexity on a large text data is 5 points higher than GPT-2. In NLP, the lower of perplexity, the better a model is. Thus the smallest GPT-2 still performs better than DistillGPT2.

Summary

A distilled network mimics the original network, but can never surpass the original model’s performance. This is because the distilled model is always smaller (the larger neural network has larger representation power) and always follows the original model (trained on the soft loss). On the other hand, a distilled network is always faster as it is smaller.?

A distilled model is a good option when we don’t have enough computing resources. It is also a viable option when its performance is close to the original model. On the other hand, when the distilled model has much lower performance (in terms of accuracy, precision, perplexity, etc.) than the original model,?or when we can speed up model computing easily with parallelization or more powerful machines, then we should choose the original model.???

Reference

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the knowledge in a neural network”, ArXiv: 1503.02531, 2015.

Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."?arXiv preprint arXiv:1910.01108?(2019).

Hugging Face documents: DistilGPT2, 2019?https://huggingface.co/distilgpt2

GPT-2

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language models are unsupervised multitask learners."?OpenAI blog?1, no. 8 (Feb 2019).

DEIT (Data-Efficient Image Transformers)

Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. "Training data-efficient image transformers & distillation through attention." arXiv preprint arXiv:2012.12877 (December 23, 2020). ?

GLUE

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. "GLUE: A multi-task benchmark and analysis platform for natural language understanding."?arXiv preprint arXiv:1804.07461?(April 20, 2018). https://gluebenchmark.com/

要查看或添加评论，请登录

Junling Hu的更多文章

A Glimpse into 2024: An Industrial Revolution

2023年11月24日

A Glimpse into 2024: An Industrial Revolution

In 2024, a seismic industry shift is underway, heralded by the release of AI Pin and enabled by OpenAI's LLM products…

2 条评论
The future is here: On AI Pin

2023年11月22日

The future is here: On AI Pin

[The company Humane has released a video of a new product, AI Pin, to be shipped in early 2024. Here are my comments on…

5 条评论
Is lockdown a solution? The consequences of a lockdown

2020年3月25日

Is lockdown a solution? The consequences of a lockdown

Yesterday India announced nationwide lockdown for 21 days, after 500 cases were reported. How many people are in India?…

2 条评论
Fake News and The Panic

2020年3月22日

Fake News and The Panic

On the first day of “shelter in place”, a friend texted me, “Police are intercepting every car, and asking where they…
The role of government and coronavirus

2020年3月17日

The role of government and coronavirus

On March 16, Santa Clara County and the other 6 counties in Bay Area declared “shelter in place”, a code name for…

4 条评论
A Breakthrough in Chatbot: Review of Meena

2020年2月28日

A Breakthrough in Chatbot: Review of Meena

Since the success of deep learning in computer vision in 2012, we have seen its rapid extension to many AI domains:…

1 条评论
Excerpt of my book: Chapter 1. AI in our home

2019年4月13日

Excerpt of my book: Chapter 1. AI in our home

An Amazon package arrived at my door. Inside the package lay a short black metal cylinder, smaller than I expected.

4 条评论
What’s in a Ph.D. Degree

2018年8月14日

What’s in a Ph.D. Degree

I recognized him as the young man came up to me, ahead of the line of people waiting to talk to me after my lecture…

4 条评论
AI In Manufacturing (Industrial AI)

2018年7月5日

AI In Manufacturing (Industrial AI)

A few years ago I attended a talk by Foxconn’s CTO. When he mentioned that Foxconn was the third largest robot…

1 条评论

See all articles

Introduction to Distilled Models

Junling Hu

AI Tech Leader

DistilBERT

DistilGPT2

领英推荐

Summary

Reference

Junling Hu的更多文章

社区洞察

其他会员也浏览了

The Math Behind Perceptron: A Step-by-Step Guide to Neural Network Learning and Decision Boundaries

Transformers without pain ??

Neural model in 60sec - How does an AI model work?

Ilya Sutskever on The Magic of Neural Networks

Can One Input Alone Power Recommendations Engine? Neural Networks Say Yes!

Configuring a Neural Network Output Layer

AI has to defend or explain too!

Improving neural network reasoning: the promise of contrastive decoding for LLMs??

Backpropagation Algorithm

Why Initialize a Neural Network with Random Weight?

DistilBERT

DistilGPT2

领英推荐

Summary

Reference

Junling Hu的更多文章

A Glimpse into 2024: An Industrial Revolution

The future is here: On AI Pin

Is lockdown a solution? The consequences of a lockdown

Fake News and The Panic

The role of government and coronavirus

A Breakthrough in Chatbot: Review of Meena

Excerpt of my book: Chapter 1. AI in our home

What’s in a Ph.D. Degree

AI In Manufacturing (Industrial AI)

社区洞察

其他会员也浏览了

The Math Behind Perceptron: A Step-by-Step Guide to Neural Network Learning and Decision Boundaries

Transformers without pain ??

Neural model in 60sec - How does an AI model work?

Ilya Sutskever on The Magic of Neural Networks

Can One Input Alone Power Recommendations Engine? Neural Networks Say Yes!

Configuring a Neural Network Output Layer

AI has to defend or explain too!

Improving neural network reasoning: the promise of contrastive decoding for LLMs??

Backpropagation Algorithm

Why Initialize a Neural Network with Random Weight?