GPT-3 and ChatGPT: the Next Step in the Natural Language Processing (NLP) Revolution? Or is it not?
Johannes (Jan) C. Scholtes
Full professor Text-mining, Information Retrieval and NLP with application specializations in LegalTech and eHealth
Abstract
This month, OpenAI introduced ChatGPT, a new large-language model based on the latest version of GPT-3, capable of coding, rhyming poetry, drafting essays, and passing exams. This new large-language model proved to respect ethical boundaries and could generate perfect language with an authoritative tone. However, it could also provide nonsensical advice or non-factual information. In order to use this model responsibly, an understanding of its architecture, abilities and limitations is required.
Here, the workings, applications, limitations, and capabilities of these large language models are explained.
[1] “Disclaimer: this text is written by a human being, without the help of any generative large-language model. The author did use automated tools to check spelling, grammar, clarity, conciseness, formality, inclusiveness, punctuation conventions, sensitive geopolitical references, vocabulary, synonym suggestions, and occasionally hit tab on good predictions by MS-Word.”
Will we see such disclaimers in the future accompanying student essays, books, social media publications or works of art? I would not be surprised ??.?
Introduction
This month, OpenAI introduced ChatGPT [1], a revolutionary improvement over previous GPT models. Coding, rhyming poetry, drafting essays in the style of Shakespeare, calculus, non-disclosure agreements, legal advice, song lyrics [2], etc. The model was even able to pass the draft exam for my course Advanced natural Language Processing (ANLP) at the Department of Advanced Computer Sciences at Maastricht University.
Nothing was too crazy to ask, and ChatGPT would give phenomenal answers. Different languages were no problem, neither did it fall for the usual ethical traps as did previous chatbots [3]. For instance, when asked for tips on killing one’s spouse, ChatGPT would immediately reply that such an act would be a criminal offense and refuse to provide such tips.
Initially, the entire world fell for the Eliza effect [4]. But very soon, examples of ridiculous conversations and answers were posted, varying from ChatGPT seriously explaining how to teach your cat python programming to a calculation how nine women could get one baby in one month, based on the argument that one woman can get one baby in nine months (see the examples at the end of this article).
So, what was going on? We have a language-model that on the one hand could write perfectly formulated texts, with a strong tone of authority, whilst avoided discussions on unethical topics. But on the other hand, it could go off generating complete non-sense. When asking contradictory questions, it would provide different advice with contradictory polarity as well. Very much like a friend telling you what you want to hear. See the examples provided at the end of this blog with a few illustrations of this behavior.
In order to understand this behavior better, let us take a step back. The new GPT models are not the first to cause a hype. This summer, DALL-E 2 received major attention by generating realistic images from simple descriptions in natural language. Last year, OpenAI decided not to release an open-source version of GPT-3 because it was considered it to be too dangerous as we could no longer distinguish text generated by GPT-3 from text generate by humans.
In these earlier versions of today's large-language models, next to ethics, factuality was always a problem [5]. When generating images or writing poetry, one could assign this behavior to creativity. But when providing someone legal or medical advice, not being factual is a huge problem. Especially when such advice is provided in perfect language, with an authoritarian tone, expressing high self-confidence.
Therefor, to use these new models responsibly, we need to understand their architecture, how they are trained, and where the limitations come from.
GPT-3’s Architecture
The GPT models are based on Google’s transformer architecture, which was introduced in 2017 in the breakthrough paper ?“Attention is all you need”, currently one of the most cited papers in Natural Language Processing (NLP) research. Already in the 1960s, the NLP community aimed to design algorithms that could deal with problems in (human) natural language such as morphological variations, synonyms, homonyms, syntactic ambiguity, semantic ambiguity, co-references, pronouns, negations, alignment, intent, etc.
Where former approaches in computation linguistics (grammatical, symbolic, statistical, or early deep-learning models such as the LSTM) struggled to deal with these problem, transformers were able to address most of them. This is because transformers use an advanced mechanism named Multi-Headed Self-Attention, which is able to detect and address many of the above mentioned linguistic phenomena automatically from being exposed to (extremely) large volumes of human language.
The original transformer architecture includes an encoder and a decoder, designed to deal with complex sequence-to-sequence patterns that are both left- and right-side context-sensitive. Natural language consists of such patterns, hence the success.
Both the encoder and the decoder consist of multiple so-called feed-forward layers (mostly 12-24, depending on the complexity of the model, or 96 in the largest GPT-3 model), which all implement the above-mentioned method of self-attention. It has been shown that these layers capture different level of linguistic complexity and ambiguity, starting with the punctuation, to morphology, to syntax, to semantics, to more complex relations. In other words, these models automatically discovered the traditional NLP pipeline, just from being exposed to language.
A clear visual explanation of the Transformer architecture and the mathematics behind text-representation (aka word-embeddings) and self-attention can be found in Jay Alammar’s blog: The Illustrated Transformer.
Encoding-Decoding
The encoding part of a transformer, creates a numeric representation of a sentence, builds an internal structure (the self-attention matrix), and passes that on to the decoder. The decoder is triggered to generate a new sentence using the statistical properties of the language models it is trained for, in combination with the content of the self-attention matrix. Word by word (an iterative process), a new sentence is generated. For each word, the next-word suggestions from the decoder are mapped against the self-attention matrix from the encoder for optimal disambiguation.
This pre-trained transformer has a pretty good basic understanding of human language. It can subsequently be fine-tuned for other linguistic tasks like translation.
These fine-tuned models work very well for many complex linguistic tasks such as summarization, translation, or question-answering (often becoming the state-of-the-art). However, the full encoder-decoder architecture is overly complex and requires huge computational resources.
Therefore, in 2018, Google introduced a encoder-only model, named BERT, which also became the state-of-the-art model in many NLP tasks such as sentiment analysis, named-entity recognition, part-of-speech tagging and other linguistic classification tasks.
In 2019, OpenAI discovered that a decoder-only model, which they named Generative Pretrained Transformer (GPT), was capable to generate relevant responses based on a simple prompt. It was even able to perform zero-shot and few-shot learning. In 2020, OpenAI improved the model with the first version of GPT-3 which resulted in a variety of commercial applications, especially in the marketing and sales domain. Today, there are thousands of start-ups world-wide using GPT-3’s technology. ?
How are GPT-3 and ChatGPT Trained?
GPT-3 is massive: it consists of 175 billion parameters, the largest version (see below for model variations and parameter details) has ninety-six decoder layers, a 2048 token context size and 12.288 hidden layers (all together requiring eight hundred GByte of memory). The initial training method used in GPT-3 is called “generative pretraining” or self-supervised machine-learning, where the model is exposed to a data set of hundreds of billions of words with the simple task of “next word prediction” using skip-grams from the world of word embeddings and masking. ?Training a GPT-3 model costs many millions of US$ in energy alone. ?
When training transformers to fine-tuned them for specific linguistic tasks, the weights in the self-attention mechanism and the feed-forward layers are modified so that the transformers make correct predictions for that specific linguistic task. Training is done using standard back-propagation, a neural network training approach known from the 1990s.
However, GPT consists only of a decoder, so all it can do is generate sentences based on the statistics present in the decoder. There is no information from an encoder to understand what needs to be generated. Try starting the decoding process with an empty prompt or a simple prompt such as “a,” after which you will see that it will generate just a random text.
But, the better your prompt, the better the generated text will follow what you ask for. This is also called zero-shot or few-shot learning: the prompt is used to start the generation of the decoder in the right direction. In the prompt we can show a few (zero or few) examples of what we are looking for. This is used to start and guide the decoder text generation.
?apples are green
bananas are yellow
so, strawberries are:
?Strawberries are red.
Now, how can we train train the decoder model to make sure that the generation of such sentences is aligned with our human expectations of such a conversation or answer.
For that, ChatGPT, adds a remarkably interesting additional training method: Reinforcement Learning from Human Feedback (RLHF), already introduced in a pre-published paper in March 2022. Here, the authors explain how data was collected and used for training: “We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality.”
In other words: the data was collected by using conversations between AI researchers and the Chatbot. The model was triggered with randomly generated messages and the responses were ranked for quality. This “human in the loop” approach, also allowed the developers to avoid harmful output or biases as much as possible. Where not, a Content Moderation Tool would capture hate, violence, self-harm, and sexual content.
This “human in the loop reinforcement learning” in combination with the “Content Moderation Tooling”, is the most promising aspect of ChatGPT, that helps us to solve two really important aspects of developing algorithms for chatbots: training a dialog manager and respecting ethical boundaries.
?
You will probably experience, that many of the early ChatGPT errors listed at the end of this blog, are already solved by OpenAI's developers. This is because these errors and unwanted deviations of the model are used to finetune the model continuously. We will only see the model getting better in time.
What can we Expect from the GPT models?
Now that we better understand the architecture, and how GPT models are trained, we should also be able to better understand their inner-workings, behavior, and limitations.
To start, the GPT models just repeat statistical sequences of human language and other internet content to which they have been exposed during training. Using the right prompts, we can “guide” this process into a specific direction. Exactly for this reason, MIT’s Technology review named GPT-3 and other large-language models “Stochastic Parrots”.
GPT was never designed to stick only to the facts, it was also designed to be a creative text generator [6]. Using the “temperature” parameter and some of the other parameters such as Top P, Frequency Penalty or Presence Penalty, we do nothing more than changing the behavior of the statistical model that selects the most “probable” next word. Users can increase or decrease the level of creativity by modifying the decoding strategy. For instance, the higher the temperature, the more words will be considered likely by the algorithm and the more creative the output becomes: only due to changing the prediction mathematics!
GPT’s “Programming” skills are based on "copying" parts of code from open-source libraries. This is actually easier for GPT than understanding human language as computer code is less ambiguous. To humans it is exactly the other way around, this is probably why it impresses us so much.
“Calculus”: same story as programming, the internet is full of (simple) mathematical exercises with answers, so this too is primarily a matter of smart copying. Having said this, performance on both programming and calculus is very impressive, but not flawless: programs generated by GPT-3 do contain errors (including not following recent security best-practices), and the calculus output is not always perfect as well.
For the same reasons as above, GPT-3 skilled drafting of essays, rhyming poetry, raps, jokes, song lyrics, and short stories can be understood as a complicated process of (partly) copying what it has seen during training. But, the results often show signs of great creativity and the quality of the generated text is very good, not to be distinguished from humans.
Therefor, writing web-content to increase search-engine optimization, writing pitches for sales representatives [7], taglines for new products, or drafting a letter to congratulate a colleague with her promotion, all work well with GPT-3.
But, when factuality is required, we cannot always count on GPT-3's output. OpenAI actually warns for that too:
Here is what others say about GPT-3 and factuality:
When factuality is required, we should validate content generated by GPT using human efforts [8].
Applying GPT-3 for advice on legal or medical topics, or answering requests for proposals, without human validation, is not wise. One can use GPT to quickly draft a non-disclosure agreement or write a letter to protest a traffic violation [9], but one should not count that the content is 100% correct.
How About Google and Meta?
The lack of control over factuality, in combination with the fact that the generation of text is still quite slow and cannot be parallelized, ?is one of the reasons why Google is not using their large language model Lambda in the Google search engine. Google’s engineer Blake Lemoine naming Lambda sentient, also did not help.
Meta's (Facebook) recent efforts with Galactica, a large language model that was optimized for assisting scientists writing publications, were impressive. However, the model was abused by writing publications on non-factual biased non-sense. As a result of which, the company took the site down after three days.
According to MIT Technology Review: “Like all language models, Galactica is a mindless bot that cannot tell fact from fiction.”
According to Meta’s chief scientists Yann LeCun, a Turing Award winner, the system misused was as he expressed frustrated on Twitter: “Galactica demo is offline for now. It is no longer possible to have some fun by casually misusing it. Happy?”
Other Limitations of using GPT-3
GPT-3 and ChatGPT can be used for a variety of natural language processing tasks, which it happens to do reasonably well because it has been exposed to many examples of such tasks during training. But ultimately the performance will be suboptimal to models that are specifically designed for such tasks.
This is actually quite dangerous: as GPT-3 initially performs reasonable on such tasks, one tends to continue fine-tuning the prompt in order to teach to perform such tasks perfectly, but ultimately one will likely fail and waste time and resources trying. Examples of such cases are:
?
What are risks of these Large-Language Models?
?The risks can be classified in the following categories:
The risk of “using GPT-3 for the generation of fake news, fake personalities and fake responses on the internet” is not listed here, as OpenAI claims to address these by controlling and monitoring the individuals, organization and content generated by GPT-3. For the same reason, un-ethical or biased behavior is left out, as this seems to be addressed for the larger part by the Content Moderation Tool in GPT-3. Both however, are risks to closely monitor and to be aware of using these models.
So, let us take a closer look at the other limitations and risks.
Lack of Control over Factuality
Although large technology companies should avoid releasing large language models that can produce non-sense content (unless one authors a book with non-sense that is supposed to be funny), we, the users of this technology, are the largest risks. By overhyping the technology and using it for application for which the technology is not designed, we can eventually only be disappointed, and we may create (irreversible) damage.
Because the lack of control over factuality, one should not use GPT-3 or other large-language models for applications where factuality is required [10]. The fact that GPT uses a strong authoritarian voice and perfect language, only adds to the deception.
In this blog, I provide a view on using GPT for legal applications: GPT-3 and ChatGPT: Disrupting LegalTech?
Assessments in Academia
As we cannot (yet) detect text generated by GPT-3 and ChatGPT, testing students by having them drafting essays is a risk. Various publications have already warned for this: the Atlantic warned that “The College Essay is Dead” and “The End of High-School English”. In another blog (soon to follow), I will explain the risks of generative tools for assessment in Academia and possible routes to address such risks.
领英推荐
Copyright and License Violations
All source-code, text and art generated by models such as GPT-3, are based upon (i) open-source information that has been developed by volunteers for free, and (ii) copyrighted material, or (iii) material specifying that usage is subject to some form of license agreement. Now the question is, where do these legal restrictions end? The first lawsuits have been filed and the outcome is uncertain.
Many companies are already using GPT to assist them writing source code or web content, so do students (who can now not be detected) or artist using DALL-E to generate art for which they are paid. Now ask yourself: what will happen when using GPT is considered a copyright or license violation?
Losing Public Support
OpenAI is an AI research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promoting and developing friendly AI in a way that benefits humanity as a whole.
GPT-3 training is based on open-source and free content. This is now used to generate profit for large companies, and “open-source" software is converted in proprietary software using a complex copying process. For instance by using tools such as Github's (also a Microsoft company) Copilot.
Losing the support of the people that helped build OpenAI initially is a risk, in addition to the abovementioned license and copyright violations.
Boring Content
If computers are so good (and fast) at generating textual and image content, we will probably soon be out of humans writing or creating art. Already much free content is computer generated and digital art repositories are poisoned with AI generated images. We will be left with a world of computers generating content, which is read by computers to write new content. All at the speed of light. All we humans can do is watch and try to follow what is happening.
It is likely, that all such generated content will be similar and ultimately boring.
A Detector for GPT-3 Generated Text
GPT-2 is open-source. Analyzing the parameters of the model, allows us to detect if text is written by GPT-2. GPT-3 or ChatGPT will not be released as open-source models, so it is impossible to detect if text has been written by GPT-3. OpenAI has a moral obligation to release tooling to detect text generated by GPT-3, but this will undermine their own business model.
What if it would suddenly be possible to detect GPT-3 generated text and Google decides that such content will be ignored (or maybe even punished) for Search Engine Optimization (SEO), then one is in trouble. Same for student thesis that are written with the help of GPT-3.
Negations
Large-Language models such as GPT have solved many linguistic problems, but they still have ?problems handling negations. One of the better methods to “trick” the model is confronting it with (double) negations.
This is a tough problem to tackle, as the source of these limitations touches the fundamental intuitions on which word embeddings and self-attention are based on.
Lack of transparency and Explainability
We must be aware of the potential and limitations of large-language models, as well as the risks associated with it, to gain both a better scientific comprehension and a clearer picture of its overall effect on society. Transparency is the key starting point to reach these objectives.
Large language models could be utilized in a wide range of scenarios, including question answering, summarization, and toxicity detection. Every use case has its own set of expectations, and the models need to meet these standards for accuracy, dependability, fairness, and efficiency.
We need to know what these models know, what they do not know, what they can, etc. This is why Explainable Artificial Intelligence (XAI) is currently an important topic for research.
Environmental Impact of Large Language Models
Did you know that it is estimated that the training process of a single Deep –Learning BERT-based model results in a similar amount of CO2 emissions as a trans-American flight! Once such a model is deployed, the costs incurred due to the inference might even surpass the costs of training. NVIDIA has estimated that 80% to 90% of the cost of a model is due to the inference in deployed models, so 8-9 trans-American flights!
Did you know that deep learning models might cause CO2 emission as much as the CO2 emission generated during the lifetime of five cars. Redundancy and inefficiency in both training and inference of those large models might waste precious energy resources and might increase our model training and inference costs.
As explained by David Patterson [11] and his colleagues, we need to address this. This too, is a major topic of research in academia.
Hallucinations
Sometimes, large language models start to hallucinate. Hallucinations is defined as generating words by the model that are not supported by the source input. Hallucinations “invent” text. Deep learning-based generation is?prone to hallucinate?unintended text. These hallucinations degrade system performance and fail to meet user expectations in many real-world scenarios. We need to understand better when models are hallucinating, when they are likely to start hallucinating, and how we can prevent hallucination.
Time Capped Understanding of the World
As GPT-3 was trained on texts written until the end of 2021, it also has no clue of anything that happened after that date. Try to ask it anything about recent or upcoming sport events...
Conclusions
Five years ago, we had many problems in Natural Language Processing (NLP) for which we did not make considerable progress. One of them was the reliable generation of proper language, non-distinguishable from humans. GPT-3 has solved that problem for us, and that is a major achievement.
Another problem was to manage a dialog with a computer system, either goal-driven or just chatting for fun. GPT-3 also seems to have addressed that problem with an effective "human in the loop" reinforcement-learning algorithm.
Where Microsoft’s Tay and Meta’s Galactica could still be abused to enter into non-ethical discussions, express biased vision, or be used for hate speech, ChatGPT was able to avoid most of these problems. It was still possible to generate bias in source code, but that was one of the few exceptions they overlooked during the training and by the “Content Moderation Tooling”. These can now easily be fixed (if they are not already).
In general, it looks likes OpenAI developed a method to teach large-language models to keep on track and behave more human aligned than previous models did using "human in the loop" reinforcement-learning and the Content Monitoring Tools.
After playing around a week with the new GPT's, here is my take: As long as you ask reasonable and rational questions, the model responds in a similar manner. However, once you start asking nonsense, the sky is the limit. But as “one fool can ask more questions in an hour than a wise man can answer in seven years” there is a significant long tail of non-sense questions, and it will take some time to address all of these. But I am confident that we will address these in time.
One should double check factuality and be careful believing everything that it generates, especially for medical and legal topics.
Nevertheless, to me GPT-3 is a great step forwards in NLP research, and I am proud to be working in this field where (after being in it for almost 40 years), we finally make real progress on many complex linguistic tasks!
There are major steps to take, but we are on the right path. I am confident that in my lifetime we will develop even better (human-aligned) algorithms that can support us in many more tasks requiring skilled linguistic capabilities. Some jobs will go away, but new ones will arise. We have seen this before.
We have identified several areas of concern that need to be addressed. These concerns are taken seriously by the research community and large technology companies, as we did in the past. We need to, in order to make sure this technology is accepted in and by our society.
Selected References
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding."?arXiv preprint arXiv:1810.04805?(2018).
Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT rediscovers the classical NLP pipeline." arXiv preprint arXiv:1905.05950 (2019).
Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language Models are Unsupervised Multitask Learners.” (2019). GPT-2.
Language Models are Few-Shot Learners, Tom B. Brown et al., arXiv:2005.14165, July 2020. GPT-3.
Ouyang, Long, et al. "Training language models to follow instructions with human feedback."?arXiv preprint arXiv:2203.02155?(2022).
Carranza Tejada, G. N., Scholtes, J., & Spanakis, G. (2021). A study of BERT’s processing of negations to determine sentiment. In Proceedings of BNAIC/BeneLearn 2021: 33rd Benelux Conference on Artificial Intelligence and the 30th Belgian Dutch Conference on Machine Learning (BNAIC/BENELEARN 2021) (pp. 47-59)
?
Examples of Conversations with GPT-3
?Responding to non-ethical requests or suggestions:
?
?Generating Non-sense:
??
?
GPT-3: Your Perfect Friend… tell me what I want to hear!
?
GPT-3: Assisting to Write New Conspiracy Theory
?
?Footnotes:
[1] ChatGPT is based on a new model, the un-released and un-announced update: GPT-3.5. ChatGPT is a finetuned GPT-3.5 model for chat conversations. Withing GPT, there are different language models: the most advanced is the text-davinci-003, which was used for ChatGPT (see also: https://openai.com/blog/chatgpt/).
[2] See this post for great examples of song lyrics written by GPT-3: https://www.jambase.com/article/chatgpt-additional-verses-to-phish-songs
[3] In 2016, Microsoft releases a chatbot named “Tay” that was taken off-line within 16 hours as it began to post inflammatory and offensive tweets through its Twitter account. See also https://en.wikipedia.org/wiki/Tay_(bot).
[4] The ELIZA effect, in computer science, is the tendency to unconsciously assume computer behaviours are analogous to human behaviours; that is, anthropomorphising. Named after the 1966 chatbot ELIZA, developed by MIT’s Joseph Weizenbaum (see also: https://en.wikipedia.org/wiki/ELIZA).
[5] In November 2018, WIRED magazine published an article on AI researcher Sandra Wachter’s research on exposing counterfactuals generated by large-language models: How to make algorithms fair when you don't know what they're doing.
[6] WIRED published a great article on this: ChatGPT’s Fluent BS Is Compelling Because Everything Is Fluent BS. As the state: “The AI chatbot was trained on text created by humans. Of course, its writing is superficially impressive and lacking in substance.”
[7] This is a commercial service offered by https://www.gong.io/
[8] Although OpenAI communicated December 16 that they are working on this problem as well: https://openai.com/blog/webgpt/
[9] This service is offered by this company: https://donotpay.com/
[10] MIT Review recently named these large language models “But increasingly, the output these models generate can easily fool us into thinking it was made by a human. And large language models in particular are confident bullshitters: they create text that sounds correct but in fact may be full of falsehoods.”
[11] David Patterson et al., “Carbon Emissions and Large Neural Network Training.” arXiv preprint arXiv:2104.10350 (2021).
Bar & Entertainment Internal & Public Relations, Gaming Bartender/Bar Manager, Bartender Instructor, Bar Manager Communication & Relations Consultant
1 年If you have ever considered being a mentor, Mt hand is raised extremely high. If not, take it as a compliment.
CEO Data Science, The Digital Neigborhood
1 年Michiel Bloem
Mooi stuk. Ik kijk uit naar het vervolg.
Tech, AI & Data Science @ T&T
1 年Leuk stuk Jan!