登录查看更多内容

A quick review: GPT-2

Ali Haidar Ahmad

NLP Engineer|ML Engineer|LLM Engineer

发布日期: 2024年5月17日

Building upon the foundation laid by GPT-1, which hinted at the potential for multi-task learning in a zero-shot setting, GPT-2 expands both in size and scale. This expansion enables it to perform a wide array of language tasks without the need for supervised training, much like how Vivaldi's "Winter" effortlessly transitions through its movements to convey the essence of the season.

Paper title: Language Models are Unsupervised Multitask Learners.

Release Date: November 5, 2019

Recommendation: Follow AI Coffee Break with Letitia on YouTube, a channel that discusses the latest technical developments in large language models (LLMs), machine learning, and more.

Hint: From the title of the paper, we can derive two main ideas: Firstly, ongoing efforts are focused on unsupervised learning from data. Secondly, the authors regard language models as multi-task models capable of learning many tasks simultaneously, particularly during the pre-training phase.

Summary

The paper goes on to begin by criticizing the idea of using supervised machine learning systems (although they excel when there is a large enough amount of data and a model of adequate size) as fragile and sensitive to small changes in data distribution and task specifications. That is, they are restricted systems that can only perform specific things according to a specific framework. Therefore, they can be described as experts with narrow knowledge, meaning that they have difficulty generalizing their knowledge and skills to new situations or tasks that differ slightly from what they were trained to do.

Therefore, the goal and direction, according to their vision, should be towards more general systems that can perform many tasks without the need for supervised training for each required task.

The paper suggests that multi-task learning (Caruana, 1997) is a promising framework for improving overall performance. This framework is still nascent and recent improvements are modest (Yogatama et al., 2019). Although recent works seem ambitious, such as (McCann et al., 2018) which involved training on 10 pairs of (dataset, objective) and (Bowman et al., 2018) which involved 17 tasks, it will be very difficult to continue to scale the creation of datasets and the design of targets to the degree required to reach general systems. From a meta-learning perspective, each pair (dataset, objective) is one training example from the distribution of datasets and objectives, and with current machine learning systems it takes hundreds or thousands of samples to create sufficiently generalized functions. This suggests that multi-task training may need a similar number of effective training pairs to achieve its promised goal.

The authors were therefore motivated to realize the idea of multi-task learning in a different way by pre-training only on unsupervised data. According to the paper:

We demonstrate language models can perform downstream tasks in a zero-shot setting—without any parameter or architecture modification.

Approach

As we talked previously in GPT-1 article, the approach is language modeling (predicting the next word based on previous words).

The concept here suggests that a language model with sufficient capacity or power would start to learn, infer, and perform language tasks in order to better predict them. If a language model were able to do this, it would actually be unsupervised multi-task learning.

Logic. The basic idea is that language itself can provide the information needed to specify tasks.

- For example, a training example of translation could be a series of tokens such as: (translation into French, English text, French text).

Likewise, an example of a reading comprehension task could be: (answer the question, document, question, answer).

Model. There are no significant changes from previous work GPT-1, with minor modifications such as moving layer normalization to the inputs of each sub-block and adding a normalization layer after the last self-attention block. They also expanded the vocabulary size to 50,257 tokens to accommodate a wider range of words. They also increased the context size from 512 to 1024 with a larger batch size of 512.

Experiments

Four different models with spaced sizes were tested:

The smallest model is equivalent to GPT-1, and the second smallest model is equivalent to the largest BERT model. Learning rates were manually optimized for best perplexity on a 5% held-out sample of WebText. Additionally, the study notes that none of the models achieved complete fitting to the training dataset, underscoring the necessity of augmenting model size in future endeavors.

Sebastian Raschka, PhD 3 周前

How to build AI tools like ChatGPT | 4 courses + 1…

Alex Wang 11 个月前

How YOU Can Help Make AI Accessible to Everyone

Lightning AI 1 年前

Language Modeling:

GPT-2 demonstrated strong zero-shot domain transfer capabilities, improving state-of-the-art performance on 7 out of 8 datasets in zero-shot settings. Significant improvements were seen on smaller datasets like Penn Treebank and WikiText-2, and on datasets requiring long-term dependencies like LAMBADA and the Children’s Book Test (CBT). Performance on the One Billion Word Benchmark was lower, likely due to dataset size and destructive pre-processing.

Children’s Book Test (CBT):

GPT-2 achieved new state-of-the-art results with 93.3% accuracy on common nouns and 89.1% on named entities. Performance improved with model size and a de-tokenizer to remove tokenization artifacts.

LAMBADA:

GPT-2 reduced perplexity from 99.8 to 8.6 and increased accuracy from 19% to 52.66%. Applying a stop-word filter further increased accuracy to 63.24%, surpassing previous state-of-the-art results.

Winograd Schema Challenge:

GPT-2 achieved 70.70% accuracy, improving the state of the art by 7%.

The dataset is small (273 examples), so further contextualization is recommended.

Translation:

GPT-2 was tested on the WMT-14 English-French test set, achieving a BLEU score of 5. This score is slightly worse than a word-by-word substitution with a bilingual lexicon from prior work.

Question Answering:

Evaluated on the Natural Questions dataset, GPT-2 correctly answered 4.1% of factoid-style questions.

GPT-2 performed significantly better than the smallest model, suggesting model capacity is crucial.
GPT-2's answers were well-calibrated, with 63.1% accuracy on the questions it was most confident in.
Performance was still below open-domain question answering systems that combine information retrieval with extractive document question answering.

Summarization:

The performance was lackluster.

GPT-2 showed significant advancements across multiple tasks, particularly in zero-shot learning and handling long-term dependencies, though there remains room for improvement in some areas like large-scale datasets and specific tasks such as translation and open-domain question answering.

Conclusion

Anyone who read my previous article or GPT-1 paper, particularly the “Experiments” section, could have anticipated that GPT-2 would involve an expansion in the scale of the model. The earlier paper hinted at the potential for achieving multi-task learning in a zero-shot setting. The expansion in model scale is crucial here. The beauty of this is that it is possible to have a model capable of performing any language task without the pain of supervised training. This development marks a significant step toward more versatile and autonomous AI systems.

Next, GPT-3 ..

要查看或添加评论，请登录

Ali Haidar Ahmad的更多文章

????? ???????? ????? ??? ???????? Byte-Pair Encoding (BPE)

2024年5月15日

????? ???????? ????? ??? ???????? Byte-Pair Encoding (BPE)

???? ??? ????????? ??????? ?? ??????? Tokenization? ??? ?????? ???? ?? ???????? ????? ??? ???????? BPE. ?? ??? ???????…

1 条评论
A quick review of the GPT-1 model

2024年5月12日

A quick review of the GPT-1 model

The GPT series of models are large language models developed by OpenAI and started in 2018. In this article, I will…
?????? ????? ?????? LLaMA-1

2024年5月11日

?????? ????? ?????? LLaMA-1

????? ?? ??? ??????? ?????? ?? ????? ????? LLaMA ?? Meta AI. ????? ??????: LLaMA: ????? ??? ?????? ?????? ???????…
???? ???? ?????? ??????? Chinchilla

2024年5月9日

???? ???? ?????? ??????? Chinchilla

????? ??? ??????? ???? ?? ???? ????? ?? ????????? ??? ?? ????? ??? GPT ? PaLM ? LLaMA ??????. ????? ??????: Training…
?????? ????? ?????? GPT-3

2024年5月8日

?????? ????? ?????? GPT-3

??????? ?? ???????? ??????? ?? ????????? ?????? ?? ????? GPT. ????? ????? ?? ?????? ?? ??????? ?????? ?? ??? ???????…
A quick review of the PaLM model

2024年5月7日

A quick review of the PaLM model

The PaLM series of models are large language models developed by Google AI in late 2022. The first version, with 540…
????? ????? ?????? GPT-2

2024年5月6日

????? ????? ?????? GPT-2

????? ?? ??????? ??????? ?? ??????? ????? ?? ????? ????? GPT. ?? ??? ??????? ????? ?? GPT-2.
?????? ????? ?????? GPT-1

2024年5月5日

?????? ????? ?????? GPT-1

????? ????? GPT ?? ????? ?? ????? ????? ???? ?????? OpenAI ????? ??? 2018. ???? ??? ??????? ???? ???? ??????? ????? ???…

1 条评论
?????? ????? ?????? PaLM

2024年5月3日

?????? ????? ?????? PaLM

????? ????? PaLM ?? ????? ?? ????? ????? ???? ?????? Google AI ?????? 2022. ??????? ????? ??? ?? 540 ????? ????? ?????…

See all articles

A quick review: GPT-2

Ali Haidar Ahmad

NLP Engineer|ML Engineer|LLM Engineer

Summary

Approach

Experiments

领英推荐

Conclusion

Ali Haidar Ahmad的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of Contrastive Learning in AI

What's In-Context Learning in Deep Learning and Why It's so Cool?

Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

Mastering Generative AI Interactions: A Guide to In-Context Learning and Fine-Tuning

MLOps at Industrial-Scale: Lessons from Google

In-Context Learning with LangChain: Revolutionizing AI Interaction

GPT-4_Part_1

Learning about Generative AI

Introduction to the World of Generative Artificial Intelligence

GPT-4 Has Landed [THE BATCH -- DeepLearning.AI]

Summary

Approach

Experiments

领英推荐

Conclusion

Ali Haidar Ahmad的更多文章

????? ???????? ????? ??? ???????? Byte-Pair Encoding (BPE)

A quick review of the GPT-1 model

?????? ????? ?????? LLaMA-1

???? ???? ?????? ??????? Chinchilla

?????? ????? ?????? GPT-3

A quick review of the PaLM model

????? ????? ?????? GPT-2

?????? ????? ?????? GPT-1

?????? ????? ?????? PaLM

社区洞察

其他会员也浏览了

Harnessing the Power of Contrastive Learning in AI

What's In-Context Learning in Deep Learning and Why It's so Cool?

Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

Mastering Generative AI Interactions: A Guide to In-Context Learning and Fine-Tuning

MLOps at Industrial-Scale: Lessons from Google

In-Context Learning with LangChain: Revolutionizing AI Interaction

GPT-4_Part_1

Learning about Generative AI

Introduction to the World of Generative Artificial Intelligence

GPT-4 Has Landed [THE BATCH -- DeepLearning.AI]