Are GenAI developers running out of data to train LLM models?

Dr. Petri I. Salonen

AI Transformation, Business Modeling, Software Pricing/Packaging, and Advisory. Published author with a strong software business background. Providing interim management roles in the software/IT field

发布日期: 2024年7月20日

This morning, I came across the?New York Times article "The Data That Powers AI is Disappearing Fast"?(sorry, but behind a paywall), and I wanted to share some of the perspectives that I have seen and learned from both the article and my work on AI.

According to the Data Providence Initiative, content made available to the collections used to build artificial intelligence has dramatically dropped. AI model developers have been building large language models (LLMs) by using publicly available internet data, including text, images, and videos.

However, according to the New York Times article, "Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group. The study looked at 14,000 web domains in three commonly used AI training sets and discovered that publishers and online platforms have taken steps to prevent their data from being harvested.

According to the researchers, three data sets (C4, RefinedWeb, and Dolma) have restricted 5 percent of all data and 25 percent of data from the highest-quality sources. Those restrictions have been set through the Robots Exclusion Protocol, which allows website owners to prevent automated bots from crawling their pages using a file called robots.txt. The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

The study shows an alarming increase in the decline in consent to use data across the web, which will have ramifications not just for AI companies but also for researchers, academics, and non-commercial entities.

Data is the fuel for AI developers, and tools such as OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude assume data to be publicly available to provide the functionality to write, code, and generate images and videos. The more high-quality data there is, the better the output from the models will be. This resembles my 20+ years in business intelligence/analytics and data warehousing: garbage in, garbage out. The same applies to AI; if you feed a model with garbage, you should not expect to get quality out of the model.

Some organizations have negotiated deals with publishers such as The Associated Press and News Corp (owners of the Wall Street Journal), providing access to their content. Some organizations, such as the New York Times, have sued OpenAI and Microsoft for copyright infringement, alleging that news articles are used to train their models without permission.

Training large language models is extremely expensive and requires a business that can afford to put millions after millions in model training. Most organizations should not even consider doing model training on foundation models. The trend is to build domain-specific models that include much less data and can be linked to these massive LLM models provided by the large corporations providing LLM models to be consumed (such as OpenAI, Microsoft AI, Google Gemini, AWS AI, Meta AI, and many others).

It is clear that smaller AI developer organizations relying on public data sets could run into business model issues as they can't afford licensing data from publishers. According to the New York Times article Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit has been cited in more than 10,000 academic studies".

Furthermore, it is not clear which popular AI products have been trained on these sources as most of them do not disclose the full list of data they use. However, data sets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus), have been used by companies, including Google and OpenAI, to train previous versions of their models. The study concludes that new tools are needed to give website owners more precise ways to control the use of their data.

Fast Company 11 个月前

Exploring the Future of AI with DBRX: What You Need to…

Data & Analytics 1 个月前

Is OpenAI becoming too big to fail?

VentureBeat 1 个月前

As an author of two business books, I understand the concerns that publishers and artists have concerning the use of data. We have put hundreds and thousands of hours into our work. If this work is then included in the LLMs models without our consent or getting paid for the work, it could be upsetting, specifically for the authors and artists who rely their livelihood on their work.

For entreprenous and organizations using public data, they should reflect on potential changes in the rules of engagement of internet data use. If the sole business model is based on GenAI data from the public domain, these organizations could run into issues. Companies building their solutions on GenAI and consuming LLM services from platform vendors like Microsoft should not worry. Microsoft has agreed to protect them from any potential infringement lawsuits. However, in these cases

The GenAI story is very similar to the story from a few years ago within the music industry. Consumers thought that they could download music for free and with unlimited use, but this changed quickly, and new vendors appeared, such as Spotify, where the consumer paid for Spotify and Spotify paid for the artists. I expect this trend to become a norm within the AI industry as well. Nothing is free. My dad told me this when I was young, and that statement still holds. Somebody will always be for the "free stuff."

Let me know your thoughts on how the AI industry will evolve and whether you and your work are already impacted by it. I would love to hear your thoughts.

Yours,

Dr. Petri I. Salonen

PS. If you would like to get my business model in the AI Era newsletters to your inbox on a weekly or bi-weekly basis, you can subscribe to them here on LinkedIn https://www.dhirubhai.net/newsletters/business-models-in-the-ai-era-7165724425013673985/

Business Models in the AI Era

59 位关注者

Russ Webb

Managing Partner at Silver Oak Commercial Realty

2 个月

Great article Petri

要查看或添加评论，请登录

查看全部

Are GenAI developers running out of data to train LLM models?

Dr. Petri I. Salonen

AI Transformation, Business Modeling, Software Pricing/Packaging, and Advisory. Published author with a strong software business background. Providing interim management roles in the software/IT field

领英推荐

Business Models in the AI Era

59 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

AI Unicorns Clarifai Brings AI to Third Party Developers

?? Experimenting with OpenAI’s Code Interpreter

AI Newsletter

Part II: How the ‘Fourth Surge’ of the ‘Double Helix of Data’ Became a Torrent of Innovation

GenAI Weekly — Edition 25

Everything about RAG and it's future

AI, Test Right

JPMorgan's AI Chatbot to Replace Research Analysts ??

AI&YOU #40: Retrieval-Augmented Generation (RAG) in Enterprise AI

What’s the future of IT Services? With Vadim Peskov. CEO of Diffco.

领英推荐

Business Models in the AI Era

59 位关注者

Are you leveled up for the AI Economy?

2024年9月27日

Book Review: Super Thinking - 300 Mental Models You Should Know About

2024年9月14日

AI Pricing Strategies for SaaS Companies Offering Copilots including Microsoft

2024年9月13日

Exploring the characteristics of Generative AI Solutions from a pricing perspective.

2024年9月10日

Software Vendors will face a new reality in software pricing when using GenAI

2024年9月7日

Are you pursuing a meaningful life, and are you avoiding distractions?

2024年8月23日

Are you a victim of multitasking, and do you understand what it truly means?

2024年8月17日

Why a consumption-based monetization model could benefit AI SaaS software vendors

2024年8月16日

Is AI just a homework-cheating machine?

2024年8月15日

Will AI kill my wife's cruise business?

2024年8月13日

社区洞察

其他会员也浏览了

AI Unicorns Clarifai Brings AI to Third Party Developers

?? Experimenting with OpenAI’s Code Interpreter

AI Newsletter

Part II: How the ‘Fourth Surge’ of the ‘Double Helix of Data’ Became a Torrent of Innovation

GenAI Weekly — Edition 25

Everything about RAG and it's future

AI, Test Right

JPMorgan's AI Chatbot to Replace Research Analysts ??

AI&YOU #40: Retrieval-Augmented Generation (RAG) in Enterprise AI

What’s the future of IT Services? With Vadim Peskov. CEO of Diffco.