Introduction to Distilled Models
In recent years (in fact the last 2 years), we have seen distilled BERT (DistilBERT) and distilled GPT models. We also start seeing distilled models for image networks (see DEIT in the reference).?What are these models? How is a distilled model different from the original model?
The idea of distilling a model, more specifically distilling a neural network, was proposed by Geoffrey Hinton in 2015. As one of the founders of the deep learning field, Hinton was concerned with the large size of neural networks. He asked: Is it possible to train a large neural network and then distill the knowledge to a smaller neural network? In other words, the large neural network acts as a teacher, and the smaller network is a student. We would like the smaller network to copy the essential weights learned by the large network. But what weights do we copy?
Hinton and his co-authors designed a training method that allows the smaller network to learn its weights by observing the teacher’s behavior (in this case, the prediction of the network). The objective of training a neural network is to minimize its error. Typically, this is measured by the distance between the model prediction and the real label. In distillation or distilled learning, a neural network is minimizing not only the error against the real label, but also the error against the teacher’s prediction. This may be counterintuitive, as we assume the real label is sufficient to train a model. By adding the error against the teacher’s prediction, we can magnify the error of the student model, and thus train the (student) network faster.
Hinton regarded the error against the real label as hard loss (in the sense of real loss), and the error against the teacher prediction as soft loss (the teacher can be wrong). Our goal is to minimize the weighted sum of these two losses. The weight assigned to the soft loss is called soft weight, represented by the symbol λ.?The total loss to be minimized is the following:
Loss=(1-λ) L_hard + λ L_soft
Hinton introduced another parameter called temperature (called T). This adjusted the softmax function at the output layer. A softmax function is also called normalized exponential function. It converts a node value by taking the exponential of the value and then normalizing it with the sum of exponential values of all the nodes in the same layer. For example, for the node i with value y_i ,?the softmax of this value is defined as:?
As we can see, this function transforms any node value to a number between 0 and 1. Since the new values of all the nodes add up to 1, they can be interpreted as probabilities. Hinton introduced a new softmax function that has a parameter T (temperature), and it’s written as:
When T is 1, this is equivalent to the standard softmax function. When T is greater than 1, the exponential value is smaller, and this is called a “softer probability distribution over classes”, where classes refer to the output classes.
The new softmax function (with temperature T) is applied to both the teacher and student, thus we need to adjust this “softening” of distribution by multiplying T^2?to the soft loss. Our total loss is now:??
Loss=(1-λ) L_hard + λ T^2 L_soft
With this new loss function, we can train a distilled neural network.?
DistilBERT
DistilBERT adopts the distillation method from Hinton’s paper with a little modification. It added one more loss to the total loss called cosine embedding loss, which is the distance between the student and the teacher’s embedding vectors. (Another small change in terminology is that the distilBERT paper calls the soft loss distillation loss).??
The distilled model (DistilBERT) has 6 layers, and 66 million parameters. BERT-base has 12 layers, 110 million parameters. The distilled model was re-trained on the same dataset of the BERT model (The English Wikipedia and the Toronto book corpus).
For the GLUE (Generalized Language Understanding Evaluation) tasks, BERT-base achieved average accuracy?of 79.5% while DistilBERT achieved 77%. For the SQuAD dataset, BERT-base achieved 88.5% F1 score, and DistilBERT achieved around 86%. Therefore BERT-base is about 2.5 percentage points better than DistilBERT.
DistilGPT2
DistillGPT2 models after the original GPT-2, which came with 4 different sizes, with different numbers of layers.??
领英推荐
The smallest GPT-2 has 12 layers and 117 million parameters (Hugging Face reports this to be 124 million parameters, maybe due to different implementation). The distilled model has 6 layers and 82 million parameters. The (word) embedding size for the smallest GPT-2 is 768, and distilGPT2 has the same embedding size of 768.
While distillGPT2 is twice as fast as GPT-2, its perplexity on a large text data is 5 points higher than GPT-2. In NLP, the lower of perplexity, the better a model is. Thus the smallest GPT-2 still performs better than DistillGPT2.
Summary
A distilled network mimics the original network, but can never surpass the original model’s performance. This is because the distilled model is always smaller (the larger neural network has larger representation power) and always follows the original model (trained on the soft loss). On the other hand, a distilled network is always faster as it is smaller.?
A distilled model is a good option when we don’t have enough computing resources. It is also a viable option when its performance is close to the original model. On the other hand, when the distilled model has much lower performance (in terms of accuracy, precision, perplexity, etc.) than the original model,?or when we can speed up model computing easily with parallelization or more powerful machines, then we should choose the original model.???
Reference
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the knowledge in a neural network”, ArXiv: 1503.02531, 2015.
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."?arXiv preprint arXiv:1910.01108?(2019).
Hugging Face documents: DistilGPT2, 2019?https://huggingface.co/distilgpt2
GPT-2
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language models are unsupervised multitask learners."?OpenAI blog?1, no. 8 (Feb 2019).
DEIT (Data-Efficient Image Transformers)
Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. "Training data-efficient image transformers & distillation through attention." arXiv preprint arXiv:2012.12877 (December 23, 2020). ?
GLUE
Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. "GLUE: A multi-task benchmark and analysis platform for natural language understanding."?arXiv preprint arXiv:1804.07461?(April 20, 2018). https://gluebenchmark.com/
?