The law of diminishing returns of Training LLMs
Abhishek Gorla
Business Intelligence Engineer 2 @ Amazon | Business Analytics, Project Management
Mark Zuckerberg pointed out that energy, not compute, will be the primary bottleneck for generative AI. Current data centers consume around 50-150 MW of power, but future AI applications may demand even larger facilities and dedicated power plants, posing a significant challenge. However, there’s another critical bottleneck we cannot overlook: the availability and quality of training data.
GPT models are language models created to complete sentences. In order to expand their capabilities significantly, we have to scale these models. Scaling, however, demands vast amounts of data. GPT-4, for instance, is estimated to have processed around 10 trillion words. We are already approaching the limits of publicly-available human-written text, and some?forecasts?suggest that we could exhaust this data by 2026.?
Of course, more data is not always the answer. We need better data. But Houston, we have another problem.?
Most of the new content published on the internet is either AI-generated or AI-assisted. This raises a concerning possibility: Future Language Models may predominantly train on content generated by its predecessors, creating a feedback loop where the model increasingly learns from its own outputs.
What does this inbreeding mean for the future LLMs?
Nothing Good.
Training on AI-generated content is like taking a photocopy of a photocopy. It gets a little worse every time we do it.?
When LLMs are fine-tuned on new tasks or datasets, they can experience catastrophic forgetting—where new learning overwrites the ability to perform previously learned tasks. This occurs because the same parameters (weights and biases) are adjusted for each new task, potentially erasing the knowledge needed for older tasks.
Training these models on AI-generated content could further intensify catastrophic forgetting. Since synthetic data often lacks the diversity and accuracy of human-generated content, it may cause the model to overwrite human-written content with less diverse and probability- based synthetic content.
领英推荐
This phenomenon is known as "Model Autophagy Disorder" (MAD). Recursively training LLM models on their own content would result in self-consuming echo chambers that degrade model quality and diversity. Put simply, generative models will go “MAD” unless regularly infused with fresh, real-world human data.?
I posed this same question to the smartest person in the room:?
Appreciate the candor.
It's not just generalists like ChatGPT; even specialists like AI Researchers share this belief:
Well, can’t we just exclude AI-generated content from model training??
Unfortunately, we cannot, at least not yet. The challenge of distinguishing synthetic content from human-generated content, known as the provenance problem, is proving to be extremely difficult.?
In fact, OpenAI discontinued its AI classifier, a tool designed to differentiate between human-generated and AI-generated content, citing its "low rate of accuracy". Although some generated text may have an obvious “tell”, as the models get better at mimicking human speech, such tools would become even more unreliable.
What can we do to solve this problem?
TPM 2 @ Microsoft | ex-Data Scientist @ Verisk | ML, AI & Product Management
3 个月Great points!