Are GenAI developers running out of data to train LLM models?

Are GenAI developers running out of data to train LLM models?

This morning, I came across the?New York Times article "The Data That Powers AI is Disappearing Fast"?(sorry, but behind a paywall), and I wanted to share some of the perspectives that I have seen and learned from both the article and my work on AI.

According to the Data Providence Initiative, content made available to the collections used to build artificial intelligence has dramatically dropped. AI model developers have been building large language models (LLMs) by using publicly available internet data, including text, images, and videos.

However, according to the New York Times article, "Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group. The study looked at 14,000 web domains in three commonly used AI training sets and discovered that publishers and online platforms have taken steps to prevent their data from being harvested.

According to the researchers, three data sets (C4, RefinedWeb, and Dolma) have restricted 5 percent of all data and 25 percent of data from the highest-quality sources. Those restrictions have been set through the Robots Exclusion Protocol, which allows website owners to prevent automated bots from crawling their pages using a file called robots.txt. The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

The study shows an alarming increase in the decline in consent to use data across the web, which will have ramifications not just for AI companies but also for researchers, academics, and non-commercial entities.

Data is the fuel for AI developers, and tools such as OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude assume data to be publicly available to provide the functionality to write, code, and generate images and videos. The more high-quality data there is, the better the output from the models will be. This resembles my 20+ years in business intelligence/analytics and data warehousing: garbage in, garbage out. The same applies to AI; if you feed a model with garbage, you should not expect to get quality out of the model.

Some organizations have negotiated deals with publishers such as The Associated Press and News Corp (owners of the Wall Street Journal), providing access to their content. Some organizations, such as the New York Times, have sued OpenAI and Microsoft for copyright infringement, alleging that news articles are used to train their models without permission.

Training large language models is extremely expensive and requires a business that can afford to put millions after millions in model training. Most organizations should not even consider doing model training on foundation models. The trend is to build domain-specific models that include much less data and can be linked to these massive LLM models provided by the large corporations providing LLM models to be consumed (such as OpenAI, Microsoft AI, Google Gemini, AWS AI, Meta AI, and many others).

It is clear that smaller AI developer organizations relying on public data sets could run into business model issues as they can't afford licensing data from publishers. According to the New York Times article Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit has been cited in more than 10,000 academic studies".

Furthermore, it is not clear which popular AI products have been trained on these sources as most of them do not disclose the full list of data they use. However, data sets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus), have been used by companies, including Google and OpenAI, to train previous versions of their models. The study concludes that new tools are needed to give website owners more precise ways to control the use of their data.

As an author of two business books, I understand the concerns that publishers and artists have concerning the use of data. We have put hundreds and thousands of hours into our work. If this work is then included in the LLMs models without our consent or getting paid for the work, it could be upsetting, specifically for the authors and artists who rely their livelihood on their work.

For entreprenous and organizations using public data, they should reflect on potential changes in the rules of engagement of internet data use. If the sole business model is based on GenAI data from the public domain, these organizations could run into issues. Companies building their solutions on GenAI and consuming LLM services from platform vendors like Microsoft should not worry. Microsoft has agreed to protect them from any potential infringement lawsuits. However, in these cases

The GenAI story is very similar to the story from a few years ago within the music industry. Consumers thought that they could download music for free and with unlimited use, but this changed quickly, and new vendors appeared, such as Spotify, where the consumer paid for Spotify and Spotify paid for the artists. I expect this trend to become a norm within the AI industry as well. Nothing is free. My dad told me this when I was young, and that statement still holds. Somebody will always be for the "free stuff."

Let me know your thoughts on how the AI industry will evolve and whether you and your work are already impacted by it. I would love to hear your thoughts.

Yours,

Dr. Petri I. Salonen

PS. If you would like to get my business model in the AI Era newsletters to your inbox on a weekly or bi-weekly basis, you can subscribe to them here on LinkedIn https://www.dhirubhai.net/newsletters/business-models-in-the-ai-era-7165724425013673985/



Russ Webb

Managing Partner at Silver Oak Commercial Realty

2 个月

Great article Petri

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了