Do language models memorize?

The New York Times (NYT) has joined the list of plaintiffs complaining that large language models, particularly the models from OpenAI, have been stealing copyrighted material. In understanding these complaints, it may be helpful to break them down into simpler legal and technical questions, taken roughly in the order in which they occur in the process.

1.???? Did OpenAI access NYT articles to build its language model (technical question)

2.???? Is this access (if it occurred) a fair use of those articles (legal question)

3.???? Did OpenAI copy the articles into its models (technical question)

4.???? Does it matter whether the models generate content similar to their training data versus retrieve that content? (legal question)

Background

Large language models are constructed of layered neural networks.? Each layer consists of metaphorical neurons that accept input values and produce an output value that depends on the inputs.? Each neuron in each layer sums the inputs that it receives from the previous layer and produces an output value to the next layer.? The degree to which each neuron affects each neuron in the next layer is controlled by a weight parameter.? GPT 4 is said to have about 1.76 trillion parameters.

The models are trained by exposing them to large amounts of text.? The training process adjusts the parameters to improve the accuracy of the model’s word (technically “token”) predictions.? Given a text (called a “context” or in production, a “prompt”), predict the word/token that follows. ?The predicted word is added to the context and the process is repeated. Training continues until the predictions are accurate.

For example:

Context Prediction

This is the way the sentence

This is the way the sentence ends

This is the way the sentence ends ,

This is the way the sentence ends, not

Each row is a context followed by a prediction.? The predicted token is added to the context and the next token is predicted.

GPT 4 is said to have been trained on a text set of about 13 trillion tokens. After this training, the models were further trained by humans to increase the probability that the models would produce socially acceptable outputs.? This human feedback adjusts some of the parameters but not the structure or operation of the model.

Once the model is trained, it can then be used to generate text.? The user provides some contextual text, now called a “prompt,” and the model generates the next token according to the parameter values that it has learned.? The generated token is added to the prompt and the model generates the next token, and so on.

The parameters of the model, in aggregate, represent the text on which the model has been trained.? The 1.75 trillion parameters sounds like a lot, but compared to the training set, it is only a tiny fraction of the number of parameters that would be needed to represent the training text exactly.? ChatGPT (GPT 3.5) had a maximum context length of 4096 tokens and a vocabulary of a million tokens.? The combination of a million tokens in each of 4096 positions in the context would require a number of parameters (10 E24576) many times greater than the number of atoms estimated to exist in the universe (10 E82).? Instead, multiple tokens share parameters and multiple parameters share tokens.? There is no representation of individual texts.? Each article gets smooshed in with every other article. Each token sequence affects a large number of parameters and overlaps with many other sequences of tokens.? In short, there is no set of neurons or parameters that represent a specific text individually and that could, for example, be removed from the model as if the text had never been used in training.

Did OpenAI access NYT articles to build its language model?

I do not know if OpenAI did or did not access the articles that they are accused of copying from the NYT.? It is certainly possible, but it is also possible that they accessed only secondary sources, perhaps quoting from NYT or following up.? The observation that the models produced text similar to that originally written by NYT does not prove that the source was NYT. Alternative explanations must be considered.?

The observation that the model generates an output that is similar to some identified source text does not guarantee that the text came from that source.? For example, the complaint cites an article that starts “GUY FIERI, have you eaten at your new restaurant in Times Square?” (p. 36).? That same text appears elsewhere on the Web so it does not guarantee that NYT was its source.

Is this access (if it occurred) a fair use of those articles?

This is a legal question that will need to be addressed by the courts.? The Fair Use Doctrine permits “the unlicensed use of copyright-protected works in certain circumstances.?Section 107 of the Copyright Act ?provides the statutory framework for determining whether something is a fair use and identifies certain types of uses—such as criticism, comment, news reporting, teaching, scholarship, and research—as examples of activities that may qualify as fair use.”

Did OpenAI copy the articles into its models?

The NYT complaint claims that the large language model “memorized” NYT content.? According to the Cambridge online dictionary , “to memorize” means “to learn something so that you will remember it exactly.”? The NYT complaint, though, defines “memorization” differently, following van den Berg and Williams (2021) to mean “memorization that arises as an increased probability of generating a sample that closely resembles the training data.”? The NYT complaint conflates these two definitions of memorization.

On page 29 of the NYT complaint , they say: “As further evidence of being trained using unauthorized copies of Times Works, the GPT LLMs themselves have ‘memorized’ copies of many of those same works encoded into their parameters.” They continue on page 30: “Such memorized examples constitute unauthorized copies or derivative works of the Times Works used to train the model.”? Their evidence of the model having memorized is the observation that under certain circumstances, the model will “output near-verbatim copies of significant portions of Times Works when prompted to do so.”

From the background section, it should be clear that the language models do not copy text in any literal sense.? There are not enough parameters in the universe to literally copy every training text into their neural networks.? Let’s consider what it means to produce “near verbatim” or text that “closely resembles” the training data, and how it could happen without copying the training text explicitly.

The “infinite monkey theorem ” says that an infinite number of monkeys typing on an infinite number of typewriters will eventually produce any text that could be written using the characters on that typewriter, including the works of Shakespeare.? In short, each sequence of letters has a certain probability of occurrence, including the sequence of letters that make up Hamlet.? Each letter has a probability of being typed, each combination of letters, each token, etc.

The infinite monkey theorem can be extended to the infinite stochastic parrot theorem. The language models learn the probabilities of each token relative to other tokens.? Each token with a nonzero probability of being produced will eventually be produced.? The difference between the monkey and the parrot theorems is that the monkey theorem assumes that all letters have an equal probability of being typed.? The parrot theorem recognizes that the tokens differ in probability, but both assume that any token (string of tokens) with a nonzero probability will eventually be produced.

When a model produces text, it selects each additional token based on these probability distributions conditional on the context.? Eventually, the model will produce each of the texts on which it has been trained, for example, just by chance (the stochastic part).? Therefore, it is not surprising that models sometimes produce the original text on which they were trained.

The probability of the model producing approximately the original text depends on the rarity of the text being mimicked and the prompt used to elicit it.? ?Van den Berg and Williams (2021) go on to say: “… we are concerned with memorization that arises as an increased probability of generating a sample that closely resembles the training data in regions of the input space where the algorithm has not seen sufficient observations to enable generalization. For example, we may expect that highly memorized observations are either in some way atypical or are essential for properly modeling a particular region of the data manifold.”? In other words, they predict that a model will produce text closer to the input patterns when those input patterns are relatively unique.? The less they overlap with other text patterns, the more likely they are to be quoted nearly verbatim because the probability distribution will closely match the rare text. ?Additionally, specific and unique prompts further constrain the produced output patterns because there are fewer continuations that are consistent (have non-zero probabilities) with these prompts. It is also well known that small changes to a prompt can lead to the production of very different texts.

Does it matter whether the models generate content similar to their training data versus retrieve that content?

The courts will need to decide whether the occasional occurrence of verbatim or nearly verbatim text constitutes copyright infringement.? I expect that it will depend on a large number of factors, including the uniqueness of the text, the uniqueness of the prompt relative to the text, the likelihood that an uninformed user would spontaneously inject the necessary prompt(s), among others.? Given the stochastic nature of language models, it may also depend on the probability of producing a nearly verbatim response.

Conclusion

My goal in writing this essay is not to refute the New York Times complaint, rather it is to try to reduce the hype and the hysteria and promote reasonable discourse about artificial intelligence.? Language models are not people in metal boxes.

Language models are conceptually very simple. They are trained to predict the next word given a context of previous words. They can be powerful tools, but they are nowhere near as powerful as some people hope and others fear.?

Large language models do not copy the text they are trained with.? They learn to estimate the probabilities of each word given the preceding context.? There is not a one-to-one correspondence between the input contexts and their representations. There is a many to many relationship, where each context is represented by values shared by many of the figurative neurons and where the neurons (parameters) represent many texts.? Indeed, the input context is represented as a relatively small vector (ordered list of numbers), called an embedding .? Up to thousands of tokens in a context are represented by about a thousand numbers.? Because words with similar meanings tend to occur in similar contexts, vectors representing similar texts tend to also be similar. ?This pattern allows the language model to generalize and produce similar responses to similar input contexts.? But when there are few texts similar to a given context, then the probabilities must be heavily weighted toward the one text that it has been exposed to.? Unique texts are more likely than generic texts to be reproduced close to verbatim

Example embedding for three sentences

.

The general principle is that the language models have only one mechanism or strategy for producing texts.? Based on their aggregated training data, they produce the tokens that are likely in response to the prompts that they have been presented.? This same strategy must explain the situations where they produce verbatim texts, where they produce non-verbatim text, and where they “hallucinate” and produce text that does not correspond to reality.? There is not one model for when they produce verbatim text, another model for when they produce fiction, and other model for when they produce accurate text.? Any claims that rest on the idea that a model somehow treats different classes of text differently will be inaccurate.

Similarly, claims that text can only have been retrieved from a specific source are difficult to assert.? The same text may be published in multiple places on the Web.? People share, which itself may be a copyright issue, but nonetheless sharing is common.? Even text protected behind a paywall, even whole books, may be obtainable from multiple sources, some of which may be included in the training set.? Verbatim reproduction of a text does not guarantee that it was obtained from a specific source.

Based on this analysis, there are two main questions that the courts will need to resolve.? Is it fair use to access copyrighted data for the purpose of training a language model?? And, is it reasonable to accept that sometimes a model may generate verbatim copies of some text?? How the courts decide these two questions will have a profound impact on the future of artificial intelligence, so they must be answered carefully, based on sound facts and reasoning.

?

Karen Brenchley

Director of Innovation & Product Management ?? Expert in AI, ML, & Data Analytics ?? Pioneer in Legal eDiscovery & Data Innovation ??

10 个月

Very good explanation. Thanks for covering these points.

回复
John Tredennick

CEO and Founder at Merlin Search Technologies

10 个月

Brilliantly simple article. You are the best at this. Thanks.

Roumen Popov

DSP Software Engineer

10 个月

When we consider the uniqueness of a piece of text, in principle it is indeed possible to generate any piece of text by pure chance, however as the length of that text grows the actual probability of generating it by chance very quickly falls to practically zero. With a vocabulary of just a thousand words and a text length of mere 30 words, the number of possible word combinations exceeds the number of atoms in the observable universe. Replicating that text by pure chance is practically impossible. This means that even a fairly small context identifies almost exactly its corresponding piece of text. Therefore it follows that if a sufficiently large context/prompt identifying a given piece of text is given to the LLM it should with very high probability output that piece of text (provided the text was in the training set and the model was trained well). Also, text is very compressible, zip easily achieves a 10:1 ration, so an LLM would need a lot less parameters to encode its raining data than the size of the data itself. How the data is encoded in the weights is probably immaterial from the point of view of copyright, e.g. jpeg encodes an image in the coefficients of a cosine transform but it's still considered a copy.

Robb Olsen

Duke Professor * Science of Innovation * Board Member Alternative Packaging Solutions * Founder TransOrbital Dynamics * Past P&G Global Products Research Leader

10 个月

Outstanding essay, Herbert Roitblat, a significant contribution to the public discourse on LLMs in that the technical parameters of model operation are easily understood, and the questions well parsed as to technical or legal in nature. Thank you for sharing.

回复
Bob Roitblat

Illuminating your path to innovative thinking, a future-proof mindset, and leadership prowess. | An international speaker & consultant. | TED Speaker | TV Villain

10 个月

Thanks for clarifying the questions that must be addressed, and for bringing clarity to how LLMs operate.

回复

要查看或添加评论,请登录

Herbert Roitblat的更多文章

社区洞察

其他会员也浏览了