A quick review: GPT-2
Building upon the foundation laid by GPT-1, which hinted at the potential for multi-task learning in a zero-shot setting, GPT-2 expands both in size and scale. This expansion enables it to perform a wide array of language tasks without the need for supervised training, much like how Vivaldi's "Winter" effortlessly transitions through its movements to convey the essence of the season.
Paper title: Language Models are Unsupervised Multitask Learners.
Release Date: November 5, 2019
Recommendation: Follow AI Coffee Break with Letitia on YouTube, a channel that discusses the latest technical developments in large language models (LLMs), machine learning, and more.
Hint: From the title of the paper, we can derive two main ideas: Firstly, ongoing efforts are focused on unsupervised learning from data. Secondly, the authors regard language models as multi-task models capable of learning many tasks simultaneously, particularly during the pre-training phase.
Summary
The paper goes on to begin by criticizing the idea of using supervised machine learning systems (although they excel when there is a large enough amount of data and a model of adequate size) as fragile and sensitive to small changes in data distribution and task specifications. That is, they are restricted systems that can only perform specific things according to a specific framework. Therefore, they can be described as experts with narrow knowledge, meaning that they have difficulty generalizing their knowledge and skills to new situations or tasks that differ slightly from what they were trained to do.
Therefore, the goal and direction, according to their vision, should be towards more general systems that can perform many tasks without the need for supervised training for each required task.
The paper suggests that multi-task learning (Caruana, 1997) is a promising framework for improving overall performance. This framework is still nascent and recent improvements are modest (Yogatama et al., 2019). Although recent works seem ambitious, such as (McCann et al., 2018) which involved training on 10 pairs of (dataset, objective) and (Bowman et al., 2018) which involved 17 tasks, it will be very difficult to continue to scale the creation of datasets and the design of targets to the degree required to reach general systems. From a meta-learning perspective, each pair (dataset, objective) is one training example from the distribution of datasets and objectives, and with current machine learning systems it takes hundreds or thousands of samples to create sufficiently generalized functions. This suggests that multi-task training may need a similar number of effective training pairs to achieve its promised goal.
The authors were therefore motivated to realize the idea of multi-task learning in a different way by pre-training only on unsupervised data. According to the paper:
We demonstrate language models can perform downstream tasks in a zero-shot setting—without any parameter or architecture modification.
Approach
As we talked previously in GPT-1 article, the approach is language modeling (predicting the next word based on previous words).
The concept here suggests that a language model with sufficient capacity or power would start to learn, infer, and perform language tasks in order to better predict them. If a language model were able to do this, it would actually be unsupervised multi-task learning.
Logic. The basic idea is that language itself can provide the information needed to specify tasks.
- For example, a training example of translation could be a series of tokens such as: (translation into French, English text, French text).
Likewise, an example of a reading comprehension task could be: (answer the question, document, question, answer).
Model. There are no significant changes from previous work GPT-1, with minor modifications such as moving layer normalization to the inputs of each sub-block and adding a normalization layer after the last self-attention block. They also expanded the vocabulary size to 50,257 tokens to accommodate a wider range of words. They also increased the context size from 512 to 1024 with a larger batch size of 512.
Experiments
Four different models with spaced sizes were tested:
The smallest model is equivalent to GPT-1, and the second smallest model is equivalent to the largest BERT model. Learning rates were manually optimized for best perplexity on a 5% held-out sample of WebText. Additionally, the study notes that none of the models achieved complete fitting to the training dataset, underscoring the necessity of augmenting model size in future endeavors.
领英推荐
Language Modeling:
GPT-2 demonstrated strong zero-shot domain transfer capabilities, improving state-of-the-art performance on 7 out of 8 datasets in zero-shot settings. Significant improvements were seen on smaller datasets like Penn Treebank and WikiText-2, and on datasets requiring long-term dependencies like LAMBADA and the Children’s Book Test (CBT). Performance on the One Billion Word Benchmark was lower, likely due to dataset size and destructive pre-processing.
Children’s Book Test (CBT):
GPT-2 achieved new state-of-the-art results with 93.3% accuracy on common nouns and 89.1% on named entities. Performance improved with model size and a de-tokenizer to remove tokenization artifacts.
LAMBADA:
GPT-2 reduced perplexity from 99.8 to 8.6 and increased accuracy from 19% to 52.66%. Applying a stop-word filter further increased accuracy to 63.24%, surpassing previous state-of-the-art results.
Winograd Schema Challenge:
GPT-2 achieved 70.70% accuracy, improving the state of the art by 7%.
The dataset is small (273 examples), so further contextualization is recommended.
Translation:
GPT-2 was tested on the WMT-14 English-French test set, achieving a BLEU score of 5. This score is slightly worse than a word-by-word substitution with a bilingual lexicon from prior work.
Question Answering:
Evaluated on the Natural Questions dataset, GPT-2 correctly answered 4.1% of factoid-style questions.
Summarization:
The performance was lackluster.
GPT-2 showed significant advancements across multiple tasks, particularly in zero-shot learning and handling long-term dependencies, though there remains room for improvement in some areas like large-scale datasets and specific tasks such as translation and open-domain question answering.
Conclusion
Anyone who read my previous article or GPT-1 paper, particularly the “Experiments” section, could have anticipated that GPT-2 would involve an expansion in the scale of the model. The earlier paper hinted at the potential for achieving multi-task learning in a zero-shot setting. The expansion in model scale is crucial here. The beauty of this is that it is possible to have a model capable of performing any language task without the pain of supervised training. This development marks a significant step toward more versatile and autonomous AI systems.
Next, GPT-3 ..