AI trained on AI garbage spits out… AI garbage
Stephanie Arnett / MIT Technology Review

AI trained on AI garbage spits out… AI garbage

AI models work by training on huge swaths of data from the internet. But as AI is increasingly being used to pump out web pages filled with junk content, that process is in danger of being undermined. In this edition of What’s Next in Tech, learn why the quality of an AI model’s output gradually degrades when it trains on AI-generated data.?

The Algorithm, our weekly AI newsletter, cuts through the hype, reveals the latest research from the hottest labs, and explains what big tech firms are doing behind closed doors. Sign up for free today.

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

New research published in Nature shows that the quality of an AI model’s output gradually degrades when it’s trained on AI-generated data. As subsequent models produce output that is then used as training data for future models, the effect gets worse.??

Ilia Shumailov, a computer scientist from the University of Oxford, who led the study, likens the process to taking photos of photos. “If you take a picture and you scan it, and then you print it, and you repeat this process over time, basically the noise overwhelms the whole process,” he says. “You’re left with a dark square.” The equivalent of the dark square for AI is called “model collapse,” he says, meaning the model just produces incoherent garbage.?

This research may have serious implications for the largest AI models of today, because they use the internet as their database. GPT-3, for example, was trained in part on data from Common Crawl, an online repository of over 3 billion web pages. And the problem is likely to get worse as an increasing number of AI-generated junk websites start cluttering up the internet.?

Current AI models aren’t just going to collapse, says Shumailov, but there may still be substantive effects: The improvements will slow down, and performance might suffer.?

To determine the potential effect on performance, Shumailov and his colleagues fine-tuned a large language model on a set of data from Wikipedia, then fine-tuned the new model on its own output over nine generations. The team measured how nonsensical the output was using a “perplexity score,” which measures an AI model’s confidence in its ability to predict the next part of a sequence; a higher score translates to a less accurate model.

Read the story to see some of the junk that AI models returned and learn how researchers are working to avoid this degradation.

Get ahead with these related stories:

  1. Google DeepMind’s new AI systems can now solve complex math problems AlphaProof and AlphaGeometry 2 are steps toward building systems that can reason, which could unlock exciting new capabilities.
  2. AI companies promised to self-regulate one year ago. What’s changed? The White House’s voluntary AI commitments have brought better red-teaming practices and watermarks, but no meaningful transparency or accountability.
  3. AI can make you more creative, but it has limitsAlthough it can boost individuals’ creativity, it seems to homogenize and flatten our collective output.

Image: Stephanie Arnett / MIT Technology Review


Asim Kr. Chowdhury, PhD

Mktg Head - Evoke Technologies | 25+ Yrs experience | Author | Doctorate @ Delhi School of Economics | ex-EU research scholar

1 个月

Insightful! From a systems perspective, this could be just the tip of an iceberg. The complexities we are likely to witness in the future is simply unfathomable.

回复
Tejasvi Devaru

Vice President | Driving value through transformative application of digital technology

1 个月

Interesting article! Appreciate the insights. Indeed, "junk in, junk out" is a crucial point when it comes to AI. Ensuring high-quality data input is essential for achieving valuable and accurate outcomes.

The proper training of AI requires an appropriate database structure. Without traceability of training data it's impossible to trust AI results. The database schema to do this was invented back around 2000 but is broadly NOT deployed in the english speaking world, just in Germany, France,...In Germany, the program to deploy AI on this data-base schema began in 2019 so the UK is 5 years behind the curve already.

  • 该图片无替代文字
回复

Please note that in this study, they did 9 iterations of training (actually fine tuning) AI from AI generated content, this level of iteration is unlikely to happen in reality, but certainly this needs to be watched, As some have suggested that AI could learn to recognize AI generated content, but AI could get smarter by making itself less detectable also. Since there is a lot of human generated content (3 billion pages is what reportedly ChatGPT trained on), this problem may take more time to make itself more manifest. There are other more pressing issues with AI namely, the power and resource usage, hallucinations, copyright issues and what some refer to as "Peak AI", where to get an incremental improvement in performance one needs to add exponentially more training data and that simply adding more details to the model (more parameters) may make it overfit the training data (which is limited). A good reference on this is here https://arxiv.org/abs/2211.04325

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了