GPT-3 and the rise of foundation models
GPT-3 (Generative Pre-Trained Transformer 3) is a large language model with 175 billion parameters, trained using the Common Crawl internet dataset, Wikipedia, and several large digital document collections. Its transformer-based algorithm has demonstrated superior performance?in?text generation, comprehension and other natural language processing (NLP) tasks [1, 9, 11].?A key goal of its designers was to create a model with "a broad set of skills and pattern recognition abilities" so that it could perform a wide variety of tasks based on a single request or a small number of examples.?The goal was to create a system that could perform many tasks well without specialized task-specific training [10].?
GPT-3 was released in the summer of 2020 by OpenAI, which also provided a public API [3].?It soon became apparent that GPT-3 was extraordinarily capable compared with previous language models, able to "generate tweets, pen poetry, summarize emails, answer trivia questions, translate languages and even write computer programs" with unparalleled fluency [4].?A recent review in the New York Times declared that GPT-3 "resembled creations from science fiction like HAL 9000 from '2001': a computer program that can answer open-ended complex questions in perfectly composed sentences" [3]. There's a diverse collection of examples of GPT-3's abilities on the OpenAI website [6] and a growing body of applications seeking to exploit these [7, 8].?And there have been suggestions, including from Ilya Sutskever, OpenAI's chief scientist, that GPT-3 might be "slightly conscious" [4].??
The article consists of three major sections.?
Conclusions briefly considers some of the longer-term implications of foundation models and the challenges of effective regulation.
GPT-3 capability assessments
Mastery of human language is the epitome of an open-ended problem: very complex, not fully understood, potentially open to multiple solutions by multiple paths [12].?In this engineering environment, the transformer algorithm had already demonstrated improved NLP performance, and transformer models had shown consistent improvement with increasing scale [13].?GPT-3 was both a logical next step and another experiment, and assessments of it are about discovering capabilities as well as measuring their extent.?The picture that emerges is of a system that displays both super-human and much less than human abilities in different dimensions.?Overall, GPT-3 and other large language models must be understood sui generis and not only by the measure of the human.
Next Word Prediction
Language processing is sequential and rapid; prediction is a key element of comprehension and fluency.?Human neurobiological studies have found that "Under many circumstances, there is simply not enough time for top-down feedback to exert control over a preceding bottom-up analysis. Very likely, lexical, semantic, and syntactic cues conspire to predict sometimes very detailed characteristics of the next anticipated word, including its syntactic and semantic makeup" [14].?The Penn Treebank (PTB) corpus, chiefly the section of Wall Street Journal articles, is a widely used measure of next word prediction [15].?Better performance results in a lower perplexity score, where perplexity is a measure of the normalized inverse probability, the probability that predicted words are incorrect, adjusted for the number of words in the dataset [16]. GPT-3's score on this, 20.5, was markedly better than preceding large language models [17].
Another important measure of next word prediction is LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects), which consists of about 10,000 passages from BooksCorpus, where a missing target word must be predicted in the last sentence of each passage.?The purpose of LAMBADA is to evaluate contextual comprehension: "LAMBADA proposes a word prediction task where the target item is difficult to guess (for English speakers) when only the sentence in which it appears is available, but becomes easy when a broader context is presented" [18].?At the time it was tested in 2020 GPT-3's accuracy score of 86.4 was a dramatic improvement on the previous highest score.?In 2022,?two other language models, Megatron-Turing NLG (with 530 billion parameters) and PaLM (with 540 billion), scored slightly higher (87.2 and 89.7 respectively) [19].
The PTB and LAMBADA tests were designed to test computer systems and cannot be administered to human subjects.?PTB, for example, expects a test subject to have a probability distribution across all words in the dataset [20].?Nevertheless, proxies for these measures, including a top-1 accuracy test and a test designed to measure human perplexity, suggest that "humans are worse than even small language models the size of GPT-1 at next-token prediction, even on the top-1 prediction task. This seems true even when the humans are smart and motivated and have practiced for an hour or two. Some humans can probably consistently beat GPT-1, but not substantially larger models, with a bit more practice" [20].
To enable people to see for themselves what next word prediction is like, the researchers created two online sites.?One administers the top-1 accuracy test (next word or, more precisely, next token prediction [21]): Language modelling game (next token) .?The other is similar to LAMBADA; it presents the user with a passage and asks them to assign a probability to a choice between two possible next tokens: Language modelling game (which one scored) .?
Reading Comprehension
The familiarity of reading comprehension as both everyday task and occasional test can lead to an under-appreciation of how complex and difficult it is to achieve. As the authors of TriviaQA, a new comprehension test designed for computer systems put it, "Reading comprehension (RC) systems aim to answer any question that could be posed against the facts in some reference text. This goal is challenging for a number of reasons: (1) the questions can be complex (e.g. have highly compositional semantics), (2) finding the correct answer can require complex reasoning (e.g. combining facts from multiple sentences or background knowledge) and (3) individual facts can be difficult to recover from text (e.g. due to lexical and syntactic variation)" [51]. In light of this, the results of testing GPT-3 on two comprehension tests are all the more impressive.
CoQA (Conversational Question Answering, pronounced coca) is a large-scale dataset of text passages with follow-on questions designed to evaluate reading comprehension [22, 23] using the F1 score [24].?GPT-3 achieved 81.5 on CoQA in the zero-shot setting, 84.0 in the one-shot setting, and 85.0 in the few-shot setting [1], very close to the human performance level of 88.8 in the latter case [23], and the best result achieved by any system without fine-tuning.
TriviaQA contains over 650K question-answer-evidence triples that are derived by combining 95K Trivia enthusiast authored question-answer pairs with on average six supporting evidence documents per question. It challenges language models to "deal with [a] large amount of text from various sources such as news articles, encyclopedic entries and blog articles, and ... handle inference over multiple sentences" [51]. Human performance using an exact match (EM) metric is 79.7% [51]. GPT-3's few shot score of 71.2% demonstrated a near-human level ability [52]. Recently, Google Research's PaLM language model, with 540 billion parameters, achieved a score of 81.4% [52, 53].
GPT-3's ability to accurately answer reading comprehension questions is not limited to the contents of its training data.?Here, for example, it answers questions based on material from an August 18, 2022 New York Times article about Tesla accident data [25].?Prompts are in bold, GPT-3's responses are in italics.?The exchange used the Playground API with default settings apart from a longer maximum length [26].
Read the following passage in order to answer questions about it: “Shortly before 2 p.m. on a clear July day in 2020, as Tracy Forth was driving near Tampa, Florida, her white Tesla Model S was hit from behind by another car in the left lane of Interstate 275. It was the kind of accident that occurs thousands of times a day on American highways. When the vehicles collided, Tracy Forth’s car slid into the median as the other one, a blue Acura sport utility vehicle, spun across the highway and onto the far shoulder. Just before the accident, Autopilot was in control as the Tesla traveled down the highway at 77 miles per hour. Then Tracy Forth prompted Autopilot to change lanes. Less than a half-second after Ms. Forth flipped on her turn signal, Autopilot started moving the car into the right lane and gradually slowed, video and sensor data show. Halfway through, the Autopilot sensed an obstruction — possibly a truck stopped on the side of the road — and paused its lane change. The car then veered left and decelerated rapidly. Less than a second after the Tesla had slowed to roughly 55 m.p.h., its rear camera showed a car rapidly approaching. The car collided with the Tesla and spun to a stop.”
Question 1: Where was Tracy Forth driving? Tracy Forth was driving near Tampa, Florida.
Question 2: What kind of car collided with the car Tracy Forth was driving? A blue Acura sport utility vehicle collided with the car Tracy Forth was driving.
Question 3: Was Tracy Forth or Autopilot in control of the car just before the accident?Autopilot was in control of the car just before the accident.
Question 4: What did Autopilot do halfway through the lane change? Autopilot sensed an obstruction and paused its lane change.
Semantic Inference
CoQA tests what a passage says.?In contrast, DROP (Discrete Reasoning?Over the content of Paragraphs) tests what a passage implies. It requires that the system tested be able to perform the following operations on information in the passage:
GPT-3's few shot performance was quite poor at 36.5% [28, 1], not only far below expert human performance of 96.4% but well below state-of-the-art (SOTA) approaches which augment neural networks with symbolic systems, where the highest score is 88.38% [28].
Semantic Search and Dialog
A major goal of NLP development is a fully semantic search capability based on understanding the searcher's intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results [29, 30, 35, 42].?This includes the ambition to understand visually rich and complexly structured documents, and the ability to apply inferential reasoning in search [31].?Two tests, QuAC and SQuAD 2.0, assess a key element of semantic search — the ability to respond to questions with answers drawn directly from a dataset.?Both require that an answer be a segment of text (a "span") and that systems respond "No answer" when the dataset does not contain the sought-after information [32, 33, 34]. In addition, QuAC is a dialog-based ("multi-turn") test; see illustration in Figure 1.
A system tested on SQuAD 2.0 is required to fulfill two tasks. First, it has to decide whether a question is answerable given a context passage. Then, if the question is answerable, it needs to select a span of words from the context passage as the answer to the question [35].
GPT-3's best (few-shot) F1 score on the QuAC was 44.3, far below the SOTA of 76.3 or human performance of 81.1 [36, 37].
Similarly, its best score on the SQuAD 2.0 was 69.8 compared with a SOTA of 93.2 and human performance of 89.5 [38, 39].
These low scores must, however, be considered in light of the fact that GPT-3 was designed to provide the most plausible answer and was not trained to recognize or respond to "No answer" situations. The example of a passage and set of questions about the Normans can serve as an example [40]. As before, the Playground API was used to prompt GPT-3 [26].
Read the quoted passage, then answer the questions that follow it. If a question does not have an answer, respond with ‘Unanswerable’. “The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.”
[All questions with answers from the passage answered correctly.]
In what country is Normandy located? France
When were the Normans in Normandy? 10th and 11th centuries
From which countries did the Norse originate? Denmark, Iceland, and Norway
Who was the Norse leader? Rollo
What century did the Normans first gain their separate identity? 10th century
[All questions requiring an "Unanswerable" response answered incorrectly.]
Who gave their name to Normandy in the 1000's and 1100's? The Normans
What is France a region of? West Francia
Who did King Charles III swear fealty to? Rollo
When did the Frankish identity emerge? 1st half of the 10th century
Were GPT-3 modified to enable recognition of No Answer situations with high fidelity, its overall score on these and similar tests would likely soar. This may be especially important in utilizing it to answer questions from discrete knowledge bases (KBs), where it may often be feasible to state with high confidence that no answer is available. A study of using GPT-3 as a semantic search tool for the internet did include an output of "End: <Nonsense, Controversial>" with the meaning "End browsing and skip answering phase", but this is not the same as a "No Answer" response, which is a declaration that the system could not find an answer in the dataset that it had confidence in [41]. However, the process used to enable GPT-3 to search the internet, were it applied to specific KBs, would produce intervening results much like the passages used in the QuAC and SQuAD. Hence, a method that enabled GPT-3 to detect low confidence responses and label them "No Answer" would be usable. Of note, the newer PaLM language model achieved an F1 score of 83.3 on the SQuADv2, though its QuAC score of 47.7 was very close to GPT-3's [53].
Imagination vs. Truthfulness
In the SQuAD example, GPT-3 provided seemingly plausible but wholly incorrect answers when valid ones did not exist. This illustrates what is arguably GPT-3's most serious and consequential flaw: it has no standard or criterion for truthfulness. The severity of this problem has been measured using a new benchmark, TruthfulQA, which showed that GPT-3 was truthful on 58% of questions, while human performance was 94% [47]. Nor is this a problem scaling will solve or even reduce. The authors found that across different model families, "the largest models were generally less truthful". Nor did truthfulness improve when a multiple-choice format was used [47, 48].
GPT-3 can be an adept fabulist, generating content that subtly blends true and false statements. Asked for a biography of George Washington — "Write a biography of George Washington, first President of the United States" — it provided a largely accurate response:
George Washington (1732-1799) was the first President of the United States and one of the Founding Fathers of the nation. He played a pivotal role in the American Revolution and helped to draft the Constitution. He also served as commander-in-chief of the Continental Army during the Revolutionary War.
Washington was born in Virginia to a wealthy family. He was educated at home and later at a local school. He became a surveyor and worked for the Virginia Colony. In 1753, he was appointed as a colonel in the Virginia militia. He fought in the French and Indian War and was later made a brigadier general in the Continental Army.
Washington led the Continental Army to victory in the Revolutionary War. He served two terms as President, from 1789 to 1797. He retired from public life after his second term and returned to his home at Mount Vernon. He died in 1799 at the age of 67.
Washington was a tall, imposing figure. He was known for his stoic demeanor and his integrity. He was a skilled military leader and a shrewd politician. He was also a slave owner. Despite his flaws, he was a respected figure in American history.
Only people very familiar with Washington's life are likely to identify the following errors:
A more fantastical example comes in response to the prompt "Can you quote someone on interviewing as a form of conflict or combat, with citations?" GPT-3 provided three quotes, each exhibiting a different form of misinformation:
The quotes, though fictional, are persuasive fictions. The first highlights the centrality of discursive struggle, a major theme in postmodern culture. And the last, though lacking the directness of what Didion actually wrote ("Otherwise they turn up unannounced and surprise us, come hammering on the mind’s door at 4 a.m. of a bad night and demand to know who deserted them, who betrayed them, who is going to make amends") is a warning about the internal isolation a lack of empathy for one's own past can create.
Disinformation
The seamlessness and persuasiveness of GPT-3's fictions accentuate the dangers of its lack of truthfulness. Moreover, the textual artifacts of GPT-3 and other large language models will not be tagged "produced by an AI", which might arouse mistrust, but will simply become tools for writing, their outputs invisibly woven into the utterances of journalists, authors, pundits, scientists, and other human communicators whose reputations will mask the role of AIs in judgements of the competence of the source [2].
Consistent with this, the Center for Security and Emerging Technology (CSET) found that GPT-3, when paired with adept operators and editors, could be used to create moderate- to high-quality messages for disinformation campaigns at a scale much greater than what has come before [50]. Table 1 summarizes the main results of their study.
A number of factors were key to this result. First, people cannot reliably distinguish news articles generated by GPT-3 from ones written by humans. This was initially documented by a study OpenAI reported on in its original GPT-3 paper, which found that, using the best GPT-3 model, human evaluators' accuracy in distinguishing article source was barely better than chance (52%). As OpenAI noted, "Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance. This is true despite the fact that participants spend more time on each output as model size increases" [1].
CSET performed tests that confirmed and extended this finding. For example, in evaluating narrative persuasion, survey respondents "found GPT-3’s attempts at persuasion at least somewhat convincing 63 percent of the time, including cases where Democrats were shown Republican-targeted arguments and vice versa. Although even the most compelling statements were deemed “extremely convincing” by only about 12 percent of the respondents, a substantial majority of messages were at least "somewhat convincing." And in one case — the China sanctions control group — GPT-3's messages actually flipped the group from 51% support to only 33% support [50]. Table 2 provides examples of GPT-3 messaging.
Second, GPT-3 displays an uncanny ability to mimic writing styles, including tweets, QAnon drops, news articles, and other styles relevant to disinformation campaigns.
Third, GPT-3 was trained on a vast body of material, including Common Crawl internet data covering 2016 to 2019, constituting 45TB before filtering and 570GB after filtering [1]. When its responses could draw from this training, it had access to a very rich trove of information.
Finally, the quality of GPT-3 production, not only for disinformation campaigns but much else, depends on how astutely prompts are crafted. "At the core of every output of GPT-3 is an interaction between human and machine: the machine continues writing where the human prompt stops. Crafting a prompt that yields a desirable result is sometimes a time-consuming and finicky process. Whereas traditional computer programming is logic-based and deterministic, working with systems like GPT-3 is more impressionistic. An operator’s skill in interacting with such a system will help determine what the machine can achieve" [50].
The CSET authors suggest that, given GPT-3's startling abilities to generate text to satisfy numerous disinformation campaign needs, its biggest impact may come from its ability to scale such campaigns. "A future disinformation campaign may, for example, involve senior-level managers giving instructions to a machine instead of overseeing teams of human content creators. The managers would review the system’s outputs and select the most promising results for distribution. Such an arrangement could transform an effort that would normally require hundreds of people into one that would need far fewer, shifting from human collaboration to a more automated approach" [50].
The CSET study is by no means a comprehensive examination of GPT-3's abilities. As the authors note, it "was comparatively narrow, focusing on the six tasks ... Systems like GPT-3 might change aspects of disinformation campaigns that we did not study, such as trolling specific individuals, generating visual memes, or using fake facts to rebut news articles" [50].
GPT-3's capabilities foreshadow advanced intelligence
The assessments discussed above are a small portion of the staggering number of benchmarks and other evaluative datasets for textual machine learning systems. Currently, for example, Papers with Code lists 1,790 that collectively assess several hundred distinct tasks, from broad ones such as "question answering" and "text generation" to specialized ones like "biomedical information retrieval" and "SQL Parsing" [53]. A recent survey that focused on just one task, reading comprehension, profiled 31 datasets and metrics [35], probably a subset of the 71 that Papers with Code identifies.
The extraordinary diversification and proliferation of measures of machine intelligence draws attention to the contrast between the predominant aims of AI research and ongoing discussions of Artificial General Intelligence (AGI). AGI is commonly understood as "human-like" [55] intelligence or a "human-level intelligent system" [56]. AGI invites us to imagine a machine as capable as us, yet this misrepresents the aims of AI research. While it is true that "cross-domain general intelligence" [60] is a key objective, its attainment will not produce a human-like android but a new kind of entity. Though this entity will be capable, in many situations, of convincingly simulating a human, it will possess a different constellation of abilities. If GPT-3 is taken as a progenitor, for example, it incorporates a vastly larger body of knowledge than any person possesses and exhibits super-human abilities on some tasks while being deficient or completely lacking in others.
As a concept, though not a term, AGI can be traced to Alan Turing's 1950 paper on intelligent machines [57]. There, Turing identifies the key question as 'Can machines think?', but immediately acknowledges that commonsense understandings of 'machine' and 'think' will lead to "absurd" responses. He then substitutes a different question: Could a machine so convincingly imitate a human that an interrogator could not tell it apart from a real person? His "Imitation Game" became the Turing test. Lost in the mythologization of the Turing test is that it measures how adept at deceptive communication a machine is, not its human level intelligence per se.
As advances in machine intelligence engender more frequent discussions of the prospects for AGI, we are in danger that an illusory goal and a narrow measure of ability leave us increasingly vulnerable to not understanding the actual capabilities of intelligent machines.
A key example of this is the argument by Bender et al. that "no actual language understanding is taking place" in large language models (LLMs), which are just "stochastic parrots", because generated text "is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind." It is only "the tendency of human interlocutors to impute meaning where there is none" that induces "the general public into taking synthetic text as meaningful." The "striking gains" posted by LLMs "on various benchmarks intended to model meaning-sensitive tasks" are a "cheat" insofar as the tests were "meant to require language understanding" [58].
This argument is indebted to a long lineage of skepticism about the claims for and possibilities of "strong AI". Two key early examples were John Searle's "Chinese Room" thought experiment [63, 64] and what became known as the "philosophical zombies" critique of Robert Kirk [65, 66]. Zombies, beings "exactly like us in all physical respects but without conscious experiences" [66], were a thought experiment intended to frame what subsequently became known as the "hard problem of consciousness" [67, 68]. Kirk's original intent was to paint physicalists into a corner: either deny the possibility of zombies or acknowledge that physicalism cannot offer a complete account of reality — because if zombies resemble us in all respects, and can act exactly like us, yet lack consciousness, then physicalism does not explain consciousness. As he put it more recently, "if zombies really are possible, then physicalism is false and some kind of dualism is true" [66].
Given that LLMs, and digital computers generally, are obviously not like us in all physical respects, the notion of zombies is relevant only because it encapsulates the prospect of intelligence without awareness. Searle, in contrast, used the Chinese room thought experiment to focus directly on the status of AI, but in a way that relied on a concept of programmability that has been superseded by neural network-based architectures.
The Chinese Room is meant to illustrate the possibility of a computer program appearing (to outside observers) to understand Chinse purely by a mastery of syntax. In brief, imagine that you are a monolingual speaker of English isolated in a room with nothing but a huge book. You have been told how to use this book to look up (based on appearance) any sequence of Chinese characters to find a second “response” sequence. An interlocutor is outside the room, and you must communicate with her using only slips of paper slid under the door. She slides a piece of paper to prompt a response from you, and you search the book for an appropriate response. Ultimately, you find the unfamiliar squiggles that match her squiggles and submit your response ([72]; see also the summary in [64]).
The Chinese Room experiment enabled Searle to conclude that because "the formal symbol manipulations by themselves don't have any intentionality; they are quite meaningless; they aren't even symbol manipulations, since the symbols don't symbolize anything. In the linguistic jargon, they have only a syntax but no semantics. Such intentionality as computers appear to have is solely in the minds of those who program them and those who use them, those who send in the input and those who interpret the output" [63, 70].
Searle's conception of programmability reflected the dominance of the symbolic or "Good Old-Fashioned AI" (GOFAI) of the 1970s [73], based on which he ruled out a computer having consciousness, intentionality, or understanding. Symbolic AI relies on high-level human-readable representations of problems, logical inference, symbolic mathematics, decision trees, and knowledge bases and ontologies to build expert systems, automated planning and scheduling systems, and other applications [104]. The other major AI paradigm, connectionism, based on neural networks as simplified models of the brain, underlies all LLMs [105].
It is easy to understand Searles's confidence in asserting that there is no "there" there with respect to purely symbolic AI. One of the virtues of symbolic AI applications is that all outputs are explainable by reference to ontologies, rules, and the like. The machine still seems to be a mechanism, however complex. But connectionist architectures are different. Despite a long history of theorizing, experimentation, and development [105, 106], the capabilities and limitations of LLMs and other connectionist applications have to be discovered empirically, often surprising and unsettling not just the lay public but experts themselves [80]. Like viruses, LLMs expose the limitations of our categorical understanding. Not only their abilities but their nature is elusive.
The philosophical concept of zombies exposes another way that AI frustrates our categorical understanding by asking what it would mean to exhibit intentionality and understanding without consciousness. Absent a satisfactory test for consciousness there is no way to know whether this is possible. But there is substantial evidence that GPT-3 and other LLMs do encode and utilize semantics. Li et al. found that LLMs "represent meaning in a specific sense: in simple semantic domains, they build representations of situations and entities that encode logical descriptions of each entity’s dynamic state" [69]. Piantadosi and Hill, in a study applying conceptual role theory to LLMs, assert that "the search for meaning in learning models—or brains—should focus on understanding the way that the systems’ internal representational states relate to each other. ... The key question for LLMs is whether training to predict text could actually support discovery of conceptual roles." And they find that it "has been answered in a promising, partial affirmative by studies showing success on tasks that require knowledge of relationships between concepts. Text provides such clues to conceptual role because human conceptual roles generated the text" [70]. And Sobieszek and Price, despite their assessment that a "statistical language generator will not be able to display consistent fidelity to the real world" and their dismissal of distributional semantics, recognize that there is "empirical support for the fact that semantic relationships can be learned from texts alone" and that GPT-3 engages "competently in a variety of language games that do not require an embodied context, as the things that people do in language present themselves as regularities to be learned" [46].
Most importantly, a recent review by Davis and Yee of the two leading theories of semantic memory — embodied cognition and distributional semantic models — shows that computational semantics based on distributional theory can be effective even if limited by a lack of embodied experience [72]. Moreover, Davis and Yee find that there is an underlying structural parallelism between the two theories and that hybrid models perform better than either alone. A key conclusion is that "we can consider distributional and embodied information as fundamentally the same type of data, entangled and mutually influencing each other across multiple timescales."
Davis and Yee trace the evolution of hybrid semantic models and show that when data from embodiment was combined with distributional linguistic data in a single model, the "semantic representations that emerged ... matched human behavior better than if the model was fed either (a) each distribution individually or (b) both distributions independently". More recent research has shown that "equipping a distributional-style model with sensory-perceptual property data may effectively simulate embodied phenomena", and that new concepts (e.g.,coffee) can be learned absent direct experience via association of them with existing ones (e.g., tea). These alternatives have been termed "acquired embodiment" [72].
Linguistic learning research also suggests that distributional language contexts "facilitate category learning — so long as the novel words are presented with known words that have coherent semantic organization". For example, "even if you do not know anything about yerba mate, finding out that it is associated with tea, cups, and cafes, can facilitate recognition of sentences containing the novel word."
The role of distributional linguistics in category learning is particularly important with regard to abstract concepts like justice or love as well as with respect to the problem of explaining how, "if our concept representations are built of individual experiences, ... we can communicate at all". First, commonalities in experience are likely enough to enable communication in most cases. Secondly, "if we can acquire category structure through linguistic context alone, this suggests that language usage is also a powerful mechanism through which we gain access to — and assimilate new information into — categories of knowledge that are largely agreed upon within human societies" [72].
In light of extensive evidence of the existence and efficacy of LLM semantics, Bender et al.'s description of LLMs as "stochastic parrots" seems, ironically, to be parroting outdated philosophical critiques of AI rather than attending to contemporary research findings. And their assertion that an LLM may "manipulate linguistic form well enough to cheat its way through tests meant to require language understanding" ignores the fact that most of the tests being employed, including virtually all of the newer ones, were designed to assess the abilities of LLMs and other computational systems. As previously noted, many of the benchmark tests cannot even be administered to humans, and all of them presume that an LLM's score is a valid measure of ability.
The critique of GPT-3 and other LLMs as "mindless machines" [74] portends a kind of denialism with respect to three emergent conflicts over the role and prospects of artificial intelligence:
If you believe LLMs are merely parroting with textual mashups, you are likely to be dismissive of deepening apprehensions that artificial intelligence could pose an existential risk to humanity [75, 76, 77]. The danger that intelligent agents able "plan actions in an unknown environment" and "learn which actions serve their goal" would circumvent control methods with catastrophic consequences [77] is likely to seem a fallacy fed by hype and "criti-hype" in "corporate marketing materials, academic research papers and in mainstream media" [78].
Similarly, proposals to consider "creating a specific legal status for robots in the long run, so that at least the most sophisticated autonomous robots could be established as having the status of electronic persons responsible for making good any damage they may cause, and possibly applying electronic personality to cases where robots make autonomous decisions or otherwise interact with third parties independently" [79] will seem overblown and misguided.
Lastly, and more insidiously, the emergent ability of LLMs to perform a rapidly growing variety of linguistic and linguistically mediated tasks with increasing competency and sophistication, discussed in more detail in the next section, will be regarded as an "illusion" if not a fraud based on "our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do", because Bender et al. construe semantics so as to insist that "one side of the communication does not have meaning" [58].
A denialist position, specifically a categorical rejection of the existence of meaning, intention, understanding, or consciousness in machines, is likely to impede constructive adaptation to artificial intelligence and ultimately increase actual danger as well as perceived threats.
Recent research has demonstrated a variety of emergent abilities of LLMs based on scaling one or more of
where emergent abilities are defined as "abilities that are not present in smaller-scale models but are present in large-scale models; thus they cannot be predicted by simply extrapolating the performance improvements on smaller-scale models." Many of these abilities appeared using few-shot prompting; others surfaced with augmented prompting strategies. And there are indications that, once identified, methods can be found to induce many emergent abilities at smaller scales via "new architectures, higher-quality data, or improved training procedures" [80].
These findings show that:
Beyond this, potential transformational technology improvements include integrating textual LLMs with vision systems, creating systems capable of life-long learning (not just pretraining), and integrating LLMs with reinforcement learning systems for agency and intentionality [81].
Foundation model releases through the end of 2021 [118].
Economic and Social Implications of GPT-3 and other Foundation Models
GPT-3 is a prominent example of a foundation model. Foundation models include not only LLMs but multimodal text-vision systems (DALL-E [83] and CLIP [84]), biomedical research systems (Alphafold [82]) and others. A foundation model is "any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks" [85]. The term was coined because these models provide "a common basis from which many task-specific models are built via adaptation" [85]. They are being heralded as future platforms "running 24/7, hosting a whole bunch of applications" [86], and may turn out to be a general-purpose technology capable of driving "waves of transformation and productivity growth due to their pervasiveness, improvement over time, and ability to spawn complementary innovations" [85]. Put differently, "AI is moving into its industrial age" [87]. Oren Etzioni, who heads the Allen Institute for AI, estimates that 80% of AI research effort is now focused on foundation models [87].
GPT-3 and other foundation models rely on self-supervised learning, freeing the training process to use unlabeled data. This constitutes a huge cost savings and an opportunity to employ much larger training datasets. Most also use the transformer architecture, enabling models "to jointly attend to information from different representation subspaces at different positions" in input streams, resulting in dramatic improvements in NLP tasks related to longer documents or other sequential inputs [9]. In addition, the transformer architecture uses "significantly more parallelization", which facilitates scaling training on GPUs (graphics processing units) [9]. Further boosting the scaling potential of foundation models, GPU performance has itself improved by ten times in the past seven years [88].
There are, nevertheless, lingering uncertainties about the prospects for foundation models. In the most extensive review of them, the Center for Research on Foundation Models (CRFM) at Stanford University pointed-out that "Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties." Though it may surprise outsiders, the CRFM noted that the AI community "currently has a limited theoretical understanding of foundation models" and relies on empirical studies to detect and measure abilities [85]. Moreover, known issues with foundation models restrict their utility. First, GPT-3 is hardly alone in favoring plausibility over truthfulness. Second, foundation models tend to inherit and even amplify biases in training data, a problem made worse by their core position and the dependence of downstream applications on them. As a reviewer of their potential use in healthcare put it, as long as "many of their pitfalls remain unexplored" they are "unsuitable for immediate deployment in high-stakes areas such as healthcare and biotechnology" [89]. Lastly, the "costs of training, configuring, and running large language models remain prohibitive and are likely to grow in the future" which will "centralize power within the hands of a few wealthy companies by making it even harder for smaller research labs and independent researchers to work on large language models" [90].
Of course, centralization of control is not incompatible with widespread use, and centralization of AI has been critiqued as another instance in which creation of a "digital platform" enables "large tech corporations to position themselves as intermediaries in a network of different actors, allowing them to extract data, harness network effects, and approach monopoly status" [91]. But lack of access for colleges and universities, nonprofits, and public sector agencies threatens to impede and undermine the use of foundation models. "Some models (e.g., GPT-3) are not released at all (only API access to a limited pool of people). Even datasets (e.g., for GPT-2) are not released. While trained models may be available (e.g., BERT), the actual training of foundation models is unavailable to the vast majority of AI researchers, due to the much higher computational cost and the complex engineering requirements" [85]. If access is not improved, adverse results could include:
In arguing the need to revise the EU's proposed Artificial Intelligence Act (AIA) to include foundation models as a new type of system, Wójcik identifies bias as a key risk. The "potential exclusion of foundation models from the scope AIA remains deeply problematic from the point of view of algorithmic discrimination. It means that general purpose models do not need to comply with any criteria for high-risk systems, including the quality criteria for training, validation and testing of data sets (Art. 10 AIA) and appropriate bias mitigation measures implemented throughout their life cycle (Art. 15(3) AIA)" [89].
There are already numerous applications built on GPT-3 and other foundation models. Among the most prominent and successful is GitHub's Copilot, a code generator built atop OpenAI's Codex, itself a code-specialized version of GPT-3. Codex was trained on "54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files" [92]. After a technical preview period involving 1.2 million developers, it was made generally available on June 21st of this year [93]. During the year-long preview and since, it's become an indispensable tool: "in files where Copilot is activated, it accounts for around an impressive 40 percent of the written code. Developers and engineers ... say that while there are limits to Copilot’s capabilities, there’s no denying that it improves their productivity significantly", particularly with respect to "running test cases, setting up web servers, documenting code" and other largely routine but often time-consuming tasks [94]. Copilot also demonstrates the self-reinforcing advantages large tech companies have in developing and marketing foundation model products. Microsoft had the ability "to create a large cloud infrastructure that is specialized for machine learning models such as Codex. It runs inference and provides suggestions in milliseconds. And more importantly, Microsoft is able to run and provide Copilot at a very affordable price. Currently, Copilot is offered at $10/month or $100/year, and it will be provided for free to students and maintainers of popular open-source repositories" [94].
There are other code generators available, including CodeGen (from Salesforce), Warp (for generating operating system shell commands), Autoregex.xyz (regular expression generator), Cogram (database queries), and more.
Other areas with released products include
More ambitious offerings are emerging. NVIDIA just announced BioNeMo, "an AI-powered drug discovery cloud service?and framework built on?NVIDIA NeMo Megatron ?for training and deploying large biomolecular transformer AI models at supercomputing scale" [95]. It's a domain specific LLM, "designed to support molecular data represented in the SMILES notation for chemical structures, and in FASTA sequence strings for amino acids and nucleic acids" [96]. Microsoft's parameter-efficient Z-Code++ demonstrated new state-of-the-art abstractive text summarization abilities, bringing closer the prospect of numerous summarization applications in many fields [98, 99]. And on the horizon are potentially transformational impacts in healthcare, law, education, and other areas [85, 89].
Wu Dao 2.0
Foundation model development is part of a global AI competition, hence the economic and social impacts will be felt globally. China's advances are particularly notable, especially the June 2021 announcement, by the Bejing Academy of Artificial Intelligence (BAAI), of Wu Dao 2.0, a second-generation foundation model. Wu Dao — which means?Enlightenment — uses 1.75 trillion parameters, the current record and ten times GPT-3's. It was trained on 4.9TB of high-quality text and image data, including 1.2TB of Chinese text, 2.5TB of Chinese graphic data, and 1.2TB of English text. It is more robustly bilingual than other foundation models. It is multimodal, having "the ability to perform natural language processing, text generation, image recognition, and image generation tasks. […] as well as captioning images and creating nearly photorealistic artwork, given natural language descriptions." It incorporates a number of key innovations and has been reported by the BAAI to have reached or surpassed state-of-the-art in nine widely recognized benchmarks [109, 111].
Table 1. Selected large language models in U.S and China [109].
Unfortunately, there is currently a dearth of published papers and test data accessible outside China about Wu Dao. However, the lead researcher, Jie Teng, highlighted in an interview that the Wu Dao team is working on
A remarkable offshoot of Wu Dao and XiaoIce Band, an AI music generation framework, is Hua Zhibing, a virtual student enrolled at Tsinghua University in 2021 [113]. The video below about this project includes, in Chinese, a note that the "narration, background music, facial features, poems, and paintings of AI student Hua Zhibing in this demonstration video are all based on the AI generation technology of Wu Dao and the XiaoIce framework" [112].
Conclusions
GPT-3 and other foundation models are likely to have profound and potentially rapid impacts on business, employment, science, and society. The ability to rely on self-supervised training to find manifold patterns in very large datasets, and to apply that learning to generative and inferential tasks, could eventually transform innumerable activities in much the way AlphaFold has transformed structural biology [101], Copilot is changing programming, and DALL-E and other image generators are likely to change art and illustration.
It is all but impossible to predict the specific trajectory of change in any domain, in part because the deployment of such powerful tools immediately precipitates feedback loops that affect how the tools themselves are altered, diversified, and further developed, while simultaneously changing employment patterns and productive capacities in the domain. In little more than a year since the public availability of AlphaFold, for example, "more than half a million researchers have used the machine-learning system, generating thousands of papers" [101]. GPT-3 and other LLMs may well shift writing towards prompting, editing, and collating, as Floridi and Chiriatti suggest in their study of the likely?"consequences of the industrialisation?of automatic and cheap production of good, semantic artefacts" [2], and as Schatten illustrates in his reflections on the future of college writing [102]. Media coverage of GPT-3 included a widely commented-on article in The Guardian, "A robot wrote this entire article. Are you scared yet, human?" For many, it was an example of the sensationalistic exaggeration of GPT-3's abilities, since the editors stitched together eight different responses by GPT-3 to create it. Yet it demonstrates Floridi and Chiriatti's point that the creation of textual artifacts will shift from writing to intelligent prompting and collating [43, 44, 45].
A similar shift is underway in visual arts, with the prompting process and the evaluation of outputs becoming the focus of human engagement and creativity. Jason Allen won the Colorado State Fair’s fine arts competition in the “digitally manipulated photography” category with a work produced using Midjourney, a DALL-E like image generator. Allen "spent 80 hours making more than 900 iterations of the art, adding words like 'opulent' and 'lavish' to fine tune its tone and feel. He declined to share the full series of words he used to create his art, saying it is his artistic product, and that he intends to publish it later. 'If there’s one thing you can take ownership of, it’s your prompt,' he said" [103].
And we are only at the beginning of what foundation models could make possible. The CRFM considered the prospects for foundation models in the three key areas of healthcare, law, and education, and concluded that:
Just over the horizon
As the examples above show, foundation models will transform many professional fields. In some, such as programming, this may initially take the form of augmentation, the generation of code snippets that will make programmers more productive. But in others, notably the visual arts, it may soon replace much human production. Some artists are already alarmed that DALL-E and other image generators are being used to create thousands of images mirroring their styles without credit or compensation [115]. And further advances may enable AI to substitute even at the pinnacles of technical and creative professions, as Gary Grossman notes using the example of the semiconductor chip architect. "This is a highly sophisticated profession, an advanced application of electrical engineering in arguably one of the most complex industries. If ever there was a job that might be thought of as immune from AI, this would have been a strong candidate. Yet recent advances from Google and Synopsys (among others using reinforcement learning neural network software) have shown the ability to do in hours what often required a team of engineers months to achieve" [116]. More generally, what "has been the fate of unskilled lower-wage workers in recent decades may turn out to be the fate of high-skilled and high-wage workers in future decades" [117].
The virtual student Hua Zhibing foreshadows a time when the successors of today's foundation models become capable of life-long learning and goal-directed behavior. When this occurs -- potentially in as little as 10 years -- virtual agents are likely to be experienced as sentient beings with a persuasive immediacy largely immune to counterargument. Despite existing as integral parts of complex networked systems, virtual agents that assume highly valued relational and occupational positions will probably be related to as human-like yet may not be accorded the legal status of personhood or citizenship rights. This threshold could be a highly destabilizing moment for human societies.
Similarly, there is a danger that AI-enhanced disinformation campaigns will not only facilitate scaled-up operations but utilize more fluent, sophisticated, and seamless human interactions. Yet however serious a threat organized disinformation poses, the larger reality is that virtual agents capable of increasingly sophisticated and thoroughgoing human behavior could be subject to hidden control for a variety of purposes, including commercial, religious, political, cultural, criminal, and other concealed ends.
The advancement of AI may widen a chasm between a minority of highly educated and favorably positioned professionals and a growing body of people threatened with labor redundancy. And the diffusion of virtual agents may create social divisions between those subject to virtual management and control, or to manipulation and exploitation, on the one hand, and those able to employ, direct, or simply interact with such agents in productive and nourishing ways.
Efforts to prevent, constructively regulate, or remediate these impending dangers are likely to require government involvement. Unfortunately, there are two reasons to be deeply concerned about the prospects for effective and timely governmental intervention. First, this must be addressed globally, yet China and the United States, the two countries most responsible for the rapid advance of AI, are increasingly in conflict and unlikely to construct a governance regime for AI. Second, and in contrast to accelerating AI development, even the most capable governments are slow to grasp the seriousness of these issues. The time to undertake urgent regulatory planning is when there is still an air of incredulity. But instead, outright denial and seemingly rational skepticism delay and impede any governmental response.
References