Training data
"Paris Library" by OpenAI's DALL-E 3

Training data

Fueling the ever-evolving landscape of large language model training continually demands access to vast and diverse textual resources. Frequently pundits have opined that we will "run out" of such data . But computer research into human language has always been limited by sufficient access to digital copies of our texts, so this issue of running out is constantly repeated and then overcome. The value in having more data available to computation creates an economic incentive to access more digitally, whether by digitizing more offline information or by increasing the datasets through accessing owned digital information, currently owned by various commercial entities.

My personal journey in this area began with the ARTFL Project , established in 1982 through a collaborative effort between the French government and the University of Chicago. This was a joint effort to create a large digitized collection of French literature. In the late 1980s when I was at the University, this massive (for that time) database of linguistic data was the core tool that we had for exploring new approaches to natural language processing. I was working with Professor Scott Deerwester, one of the authors of Latent Semantic Indexing (LSI), one of the first successful approaches to extracting conceptual content from a text by establishing associations between terms that occur in similar contexts.

The commitment by the French government to invest in digitizing French literation exemplifies the kind of foundational resources necessary for the ongoing development and refinement of our machine intelligence systems. Computational research into the structure of language started in the 1950s and what we have learned over the past 70 years, in particular with the success of generative pre-trained transformers (GPT), is that an underlying "structure" for language can be computed (you can read more on this in a terrific, though long, essay by Steven Wolfram: What is ChatGPT Doing and Why Does it Work ).

And so we arrive at the commercial conflict that exists today between individual owners of potential training data (an author or a commercial entity like the New York Times) and the companies that would seek to train their large language models on this data. There is no doubt that data is of value as the training would not be possible without data, but determination of the specific value of any one element of data and understanding what compensation is merited becomes complicated very quickly.

Let's explore this issue from a "bottoms up" and a "top down" perspective...

Bottoms Up

What I mean by "bottoms up" is to take a specific piece of information and try to follow the chain up to understand how that one data element might be valued. As a thought experiment let's consider one of the most influential and important contributions that the New York Times has made in their history -- the decision (at enormous risk) in 1971 to publish the Pentagon Papers. These government documents, assembled starting in 1967 by Robert McNamara, President Lyndon Johnson’s secretary of defense, were intended to inform government decision makers on matters related to the US war in Vietnam. The New York Times was able to obtain copies of 7,000 documents in this collection from an employee at the Rand Corporation (Daniel Ellsberg). The New York Times published its first article on the topic on June 13, 1971 and continued (after winning a case in the US Supreme Court to allow publication) until the final installment on July 5, 1971.

By any measure the investment made by the New York Times created enormous value for the paper and for society. Our ability to function as a democracy depends on our ability to hold elected and unelected officials responsible for their decisions and this can only happen when we have the truth about those decisions. So what is the New York Times "owed" if their articles about the Pentagon Papers are used in training a language model?

Unpacking this is problematic - first of all the 7000 documents on which their reporting is based is not the property of the New York Times. As a newspaper, the Times reports on things in the real world - and those things, while they may be discovered by the reporting staff, exist independently from any intellectual property that the Times might create. Then there are the specific words crafted by a journalist and published in their paper - these are certainly "owned" by the New York Times Company. But how different are those words from the ones that every other news outlet in the country subsequently published on this same topic? And finally, there are the millions of people who have written something in the 50 years since the 1971 publication of these articles which quote some portion of the articles. In this stew of words, what is the unique value of the ones which appeared in print in the paper that summer? I would argue that the unique value of the articles existed only at the time of publication and began to diminish immediately thereafter. There are certainly other types of information other than news which have a "longer shelf-life" but will also diminish over time as others explore the same topics and issues.

Top Down

By "top down" I mean to explore this issue from the perspective of the total training data set -- that is, what any single component of that data set might be worth when considered as part of the whole. The largest current commercial model, GPT-4, is said to have been trained on roughly 10 trillion words. While OpenAI hasn't made any public statements about GPT-5, a number of analysts believe it could be as much as twice the size of GPT-4, or perhaps 20 trillion words. An average US daily newspaper like the New York Times might contain 200,000 words. Thus the GPT-5 training data would consist of one hundred million days of New York Times data. Since its founding on September 18, 1851 the New York Times has published something less than 63,000 editions. Thus if this (VERY) rough calculation was to indicate the total contribution of words from the New York Times into a 20 trillion word training data set, it might total .0006 of that total. That is a generous calculation given that early newspaper editions were certainly smaller than 200,000 words and also that many of the words published were not owned by the New York Times Company - for example, they published full versions of many of the 7000 Pentagon Papers documents.

Calculating Value

Sam Altman has publicly stated with respect to the current lawsuit with the New York Times that, "we actually don't need to train on their data." The lawsuit itself seems to depend upon the somewhat clumsy attempts by Times lawyers to trick ChatGPT into reproducing the full text of articles, thus making the argument that the inclusion of their content into the training data would create a competitive alternative to their own publication. But both from a bottoms up and a top down perspective we can see that the novel value of any one piece of content or even an entire publication history dating back to 1851 is minuscule in the scope of the growing corpus of digitized information and also minuscule when considered in the context of the average use case for the LLM.

Altman's pronouncement is an interesting economic test of value -- is the value of the whole (GPT-5 for example) impacted at all through the exclusion of any one part (all of the New York Times for example). If the answer is clearly "no" then perhaps the value of that part is effectively zero. Only when you can say that the value anticipated by a user of the whole (GPT-5) is dependent in some way on that part (for example the desire to read a specific New York Times article) would you then attribute a value to that part. Otherwise it is just a series of words, perhaps as in Borges Library of Babel , a series which "...includes all verbal structures, all variations permitted by the twenty-five orthographical symbols, but not a single example of absolute nonsense...." but nonetheless just a series of words that make up one extremely small part of computing the underlying logic of human communication.


Ajay Mathew

Senior Client Partner | Travel, Transportation and Hospitality

9 个月

Calculating the volume of content in articles published by NYT, as a percentage of the total corpus of written human knowledge, may not be an appropriate way to determine the value of NYT content, to a LLM like GPT. As a creator, publisher and owner of written content, I am free to set a price for making this content available (for e.g. to somone who wants to use it for training an LLM model). My content being only a miniscule part of the total LLM training data is irrelevant. There is no incentive for me to price this access at a fraction of a cent. .. I would rather just not provide that access. At a bare minimum, I would expect to be paid the same amount that would be paid by a human to access this copyrighted work. If we extrapolate this for all copyrighted works, then much of that content becomes unavailable (or cost prohibitive) for training an LLM. This Economist article by Ben Sobel is an interesting and relevant read - https://www.economist.com/by-invitation/2024/02/16/dont-give-ai-free-access-to-work-denied-to-humans-argues-a-legal-scholar

回复

I'm trying to imagine a royalty model akin to that of Spotify, etc., but your article clearly shows the complexity, if not impossibility, of that type of model here. If the entire NYT canon is only .0006 percent of the word stew that comprises GPT-5's diet, then the NYT's content is essentially irrelevant to GPT's brain. Yet the request for the full text of a specific article complicates matters exponentially, because you can get that too. At a gut level, I have qualms about letting the LLMs have free use of all written material on the planet, in every language, without the creators of those materials receiving some recompense. One place to start, methinks, is to look at Explainable AI, one of my favorite capabilities of any artificial intelligence; the algorithms can tell us what they're doing if we ask them. Yet the complexity of determining the percentage of NYT content in any GPT response to a request and calculating the financial value of that tidbit seems to make the whole royalty concept absurd. We live in interesting times! Thanks for your ongoing investigations, Ted

Francis Carden

CEO, Founder, Automation Den | Analysis.Tech | Analyst | Keynote Speaker | Thought Leader | LOWCODE | NOCODE | GenAi | Godfather of RPA | Inventor of Neuronomous| UX Guru | Investor | Podcaster

9 个月

It's a bit akin to detecting plagerism by humans of humans. Pretty much impossible when not blatant! Look at just some of the many failed lawsuits on music / songs over the years. Perhaps AI detection on "works of art" can self donate when it detects backwards in time (how far) to the original thought/source!! Hmmm. How long should news content be protected? 1 month? 1000 years?

回复
John Sviokla

Executive Fellow @ Harvard Business School | D.B.A., GAI Insights Co-Founder

10 个月

Excellent article Ted on laying out some of the complexities. Of course the vast coda of language is a public good, as is most of the teaching of language. You raise the $64,000 question (should I have to pay for that reference?;) -- what's fair? I believe that the current legal primitives of copyright, fair use, etc., are insufficient for world in which the corpus of language can be at least partially computed. Put another way, no legal system (or any system for that matter) can outgrow its primitives -- and I think we need new legal primitives like: creation, access, recombination, etc. Just as with land, where we have title to the land itself, mineral rights, rights of way, etc., we need new legal concepts as primitives for new contracts to begin to create prices and contracts for access, recombination, etc., etc. I don't have an answer, but for my money, we need more fundamental questions...

Lori Sherer

Bain & Company San Francisco, Technology, Healthcare, Data Science & GenerativeAi

10 个月

Great perspective Ted - as always

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了