Coming soon: stagnation and collapse of AI?
Andreas Amrein ? PhD MBA MSc
Global SVP and MD ★ Pharma / Biotech / MedTech ★ Expansions ★ Transformations ★ Acquisitions ★ Partnerships
Have we all been exposed to or directly used Artificial Intelligence in our jobs? After years of experiencing its growing influence, we are now beginning to notice a significant limitation—one that might call everything into question.
The core issue: AI's dependency on real data
AI relies on vast amounts of real-world text and images to improve, but the problem is that human-generated data is limited. AI companies are now scrambling to find a solution.?
What users are observing
Consider a striking example: When AI generates images of hands, the result is sometimes bizarre. The number of fingers might be wrong, or the wrists appear unnaturally bent. The reason? AI only learns from examples; it doesn’t "know" how many fingers a person has. It simply processes the images it's been trained on, and hands—especially in diverse, nuanced positions—are underrepresented in online datasets. Faces, on the other hand, are more frequently depicted, so AI is able to make them appear more accurate.
...but it’s not just hands...
You could replace "hands" with any other infrequently represented concept on the internet. The AI’s limitations extend to anything with sparse or inconsistent data, often resulting in factual inaccuracies or complete inventions.
The proposed solution
Some researchers believe that these so-called hallucinations can be reduced by increasing the amount of training data. After all, AI's recent advancements have largely been the result of massive training sets, not groundbreaking new techniques. The exponential growth in data used to train language models is undeniable.
However...
AI has already combed through most of the internet. The majority of publicly available data—Wikipedia, online forums, and digitized books—has already been processed, revealing a critical limit. Moreover, as AI continues to generate content, AI-produced data is being mixed back into the web, further complicating the training process (see recursive iterations).
The "Photocopy Effect": AI learning from AI
Nicolas Papernot from the University of Toronto, along with other researchers, has studied the effects of AI learning from AI-generated content. He compares it to making photocopies of photocopies—each copy is less accurate than the last. For instance, if an AI is trained to generate cat images based on 100 photos, 90 of which depict yellow cats and only 10 show blue cats, AI will likely make the blue cats appear more yellow. If another AI model is trained on those generated images, the "blue" cats might eventually disappear entirely.
This moment, termed the "collapse of the AI model," marks the point where AI’s creations no longer resemble reality. And the consequences of this are far-reaching—what happens when AI loses the “blue cats”? Applied to people, for example, we risk erasing important details, possibly leading to bias and the marginalization of minority groups.
领英推荐
Is there hope? Let's look at games...
Interestingly, AI has shown remarkable progress in some areas. For instance, DeepMind’s AlphaGo learned the game of Go by playing millions of games against itself, achieving groundbreaking success. In 2016, AlphaGo made a move that no human player had ever seen before, thrilling the Go experts. This success demonstrates the potential of synthetic data, at least in domains with clear rules.?
But there are limitations of synthetic data
While games like Go have defined rules, language and images are vastly more complex. They lack clear metrics for success, and without rules, generating useful synthetic data becomes nearly impossible. Thus, AI's innovative language systems, rely entirely on examples rather than rules, further compounding the data problem.
Hence, AI companies’ desperate search for data
The scarcity of training data is forcing AI companies to seek alternative, often questionable, sources. For example, Meta has clashed with EU authorities over its intention to use users' posts and images to train AI. In other regions without strict data protection laws, it’s already doing so. According to a New York Times investigation, OpenAI has likely transcribed vast amounts of YouTube videos—potentially illegally—to train GPT-4. Google has also adjusted its terms of use, possibly allowing it to harvest data from restaurant reviews and public Google Docs.
Companies are pulling data from every available source because time is running out. Epoch AI estimates that by 2028, human-generated public content will no longer be enough to train better AI models.?
The ripple effect on blogs and media
Generative AI is already changing the content landscape on the internet. As more users turn to AI chatbots instead of traditional browsing, websites that rely on ad revenue from clicks are feeling the pressure. This impacts not only online magazines but also forums like Stack Overflow, where users once exchanged programming advice. With AI assistants now providing answers, these forums are losing their traffic—and another valuable source of human-generated training material disappears. Similar examples abound.
What comes next?
For AI to continue evolving, new approaches are needed. Innovations in how AI learns and how it extracts more value from existing data will be crucial. As we navigate this complex landscape, it's important to remember…at the heart of it all is us, the humans. Let’s keep in mind the quote attributed to Albert Einstein
“It has become appallingly obvious that our technology has exceeded our humanity.”
The future of AI remains uncertain, but one thing is clear—it’s going to stay interesting, and we have one more reason to continue to create human content, real data, instead of just consuming.
?
?