The law of diminishing returns of Training LLMs

The law of diminishing returns of Training LLMs

Mark Zuckerberg pointed out that energy, not compute, will be the primary bottleneck for generative AI. Current data centers consume around 50-150 MW of power, but future AI applications may demand even larger facilities and dedicated power plants, posing a significant challenge. However, there’s another critical bottleneck we cannot overlook: the availability and quality of training data.

GPT models are language models created to complete sentences. In order to expand their capabilities significantly, we have to scale these models. Scaling, however, demands vast amounts of data. GPT-4, for instance, is estimated to have processed around 10 trillion words. We are already approaching the limits of publicly-available human-written text, and some?forecasts?suggest that we could exhaust this data by 2026.?


??Mira Lapata | University of Edinburgh?

Of course, more data is not always the answer. We need better data. But Houston, we have another problem.?

Most of the new content published on the internet is either AI-generated or AI-assisted. This raises a concerning possibility: Future Language Models may predominantly train on content generated by its predecessors, creating a feedback loop where the model increasingly learns from its own outputs.


What does this inbreeding mean for the future LLMs?

Nothing Good.

  • Homogenization: The models could begin to reinforce existing patterns, resulting in a more homogenized language model that lacks creativity and diversity.
  • Bias Amplification: Any existing biases in earlier versions could be amplified in future iterations.
  • Performance Degradation: Models’ ability to generalize and handle unknown complex or abstract concepts could be compromised.

Training on AI-generated content is like taking a photocopy of a photocopy. It gets a little worse every time we do it.?

When LLMs are fine-tuned on new tasks or datasets, they can experience catastrophic forgetting—where new learning overwrites the ability to perform previously learned tasks. This occurs because the same parameters (weights and biases) are adjusted for each new task, potentially erasing the knowledge needed for older tasks.

Training these models on AI-generated content could further intensify catastrophic forgetting. Since synthetic data often lacks the diversity and accuracy of human-generated content, it may cause the model to overwrite human-written content with less diverse and probability- based synthetic content.

This phenomenon is known as "Model Autophagy Disorder" (MAD). Recursively training LLM models on their own content would result in self-consuming echo chambers that degrade model quality and diversity. Put simply, generative models will go “MAD” unless regularly infused with fresh, real-world human data.?

I posed this same question to the smartest person in the room:?

Appreciate the candor.

It's not just generalists like ChatGPT; even specialists like AI Researchers share this belief:


Well, can’t we just exclude AI-generated content from model training??

Unfortunately, we cannot, at least not yet. The challenge of distinguishing synthetic content from human-generated content, known as the provenance problem, is proving to be extremely difficult.?

In fact, OpenAI discontinued its AI classifier, a tool designed to differentiate between human-generated and AI-generated content, citing its "low rate of accuracy". Although some generated text may have an obvious “tell”, as the models get better at mimicking human speech, such tools would become even more unreliable.


What can we do to solve this problem?

  • Better Content Classification: Ironically, add Human-in-the-Loop for data validation to ensure that the training data is sourced from high-quality, human-generated content.?
  • Stronger Guardrails: Implement safeguards to prevent current models from generating harmful or nonsensical outputs. Chances are, its future generations could inherit their current bad behavior.
  • Downsizing and optimizing the models: GPT-4 is anticipated to have around 100 trillion parameters. For context, the human brain has approximately 86 billion neurons, connected by an estimated 100 trillion synapses, and not all are used exclusively for language processing. Building Language models of the size of a human brain might not be the most sustainable long-term strategy. Reaching the next frontier in deep learning requires more than deep pockets. For example, the BigScience project’s T0 model, which is 16 times smaller than GPT-3, outperformed GPT-3 on many tasks, demonstrating that smaller, more efficient models can achieve high performance without the need for massive scaling.
  • Transfer Learning: Pre-trained LLMs are fine-tuned on specific tasks with smaller datasets. This approach achieves high performance while reducing the dependency on vast amounts of new, publicly available data, adding specialized depth to the broad capabilities of LLMs. However, the knowledge transfer in this process is unidirectional—fine-tuning impacts only the fine-tuned version of the model, it does not inform or improve the original pre-trained model.
  • More Original Human Content: Don't ask ChatGPT to write everything. We may have to rely a little less on AI today in order to have better AI in the future.

Apoorva Jadhav

TPM 2 @ Microsoft | ex-Data Scientist @ Verisk | ML, AI & Product Management

3 个月

Great points!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了