AI Efficiency Limits: Datasets, Computation Budget and Chinchilla Scaling Laws

AI Efficiency Limits: Datasets, Computation Budget and Chinchilla Scaling Laws

We have already reviewed the size (i.e., parameter quantity) limitations of current AI models. Now, I would like to examine other AI limitations: training datasets and computation budgets.

"Computation Budget" by Midjourney

Computation Budget — this term refers to the resources needed to train a model, including time, memory space, electricity, and, of course, the quantity and power of CPUs, GPUs, and TPUs involved in the process. It's clear that computation budget is a critical factor: the more GPUs, and especially TPUs, we have, the faster and better the results. But how many resources are needed to train AI, especially if we aim to make progress in development?

Let’s dive a little deeper.?

Machine learning is conceptually similar to human learning: the model receives a sample of a task, tries to solve it, and learns from its mistakes.

Model's Groundhog Day

The AI we all hope to develop must be able to solve an enormous range of tasks, which means it needs to "study" a lot — an unimaginable amount.

Unfortunately, current learning algorithms are far from efficient. Learning is still a challenging process. To solve a specific task, a model must receive a huge amount of “task => solution” samples. Even for just a few tasks, the amount of information the model needs is massive.

"Computation Budget" by Stable Diffusion

Initially, datasets for model training were created manually. Each dataset required hundreds of hours of human labor and contained millions of samples, but was designed to train the model for a single, specific task. Around 2018-2019, the industry shifted away from this approach, deeming it too time-consuming, expensive, and inflexible.

Instead, scientists began using any available raw data for model training without filtration or preparation like web pages or images downloaded from the Internet.

Sample of data set

While the quality of such data cannot compare to manually prepared datasets...

With sample of its data. "Blue cat". The accuracy problems can be found at any row :-)

...its quantity improved machine learning, as volume became crucial to making progress with inefficient learning algorithms. The more diverse, unfiltered data was used for training, the more intelligent the model became.

The fundamental research of model intelligence dependency from the size of learning dataset didn't make the world wait for them:

In 2020 OpenAI issued the article “Scaling Laws for Autoregressive Generative Modeling” where was stipulated that the increasing the quantity of parameters in 10 times required the increase of learning data set only in 2.5 times (in 10^0.4 to be precise)... and computation budget in 25 times.

In 2022 DeepMind issued the article “Training Compute-Optimal Large Language Models” where the previous dependencies were significantly reviewed: the learning data and size of the models should be scaled equally. It means that all previous models were learned with an extremely non-sufficient volume of learning data.?

This dependency was called as Chinchilla Scaling Laws:

“We test this hypothesis by training a predicted compute optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.”

These two articles allow us to make the common conclusion that the parameters increasing in 10 times required the increase of learning data set also in 10 times… and computation budget in 100 times.

The Chinchilla Scaling Laws gave us at least two things:

  1. Another interesting explanation why the trend of parameters increasing slowed down.
  2. The integrity indicator that could help us to understand the size and intelligence of the modern models — the computation budget that was used for their training.

There is another article that can help us to what trends are there:? Compute Trends Across Three Eras of Machine Learning

The most interesting part of this article reveals two trends:

Graphic reason for building new nuclear power plants

  1. Deep Learning Era models (mostly academic) where the computation budget doubles every 5-6 months.
  2. Large Scale Era models, trained by big corporations, started with much larger computation budgets and double them every 9-10 months.

But as big corporation doesn’t share their data, there is a room for alternative calculations, as an example:?

To train GPT-3 required 3.14e23 flops.

To train GPT-4 required 2.15e25 flops.

The model has increased 68 times for 33 months or doubled every 5-6 months :-)

Will computation budgets continue to grow aggressively?

It's for 2023, in 2024 it is going to be bigger

Judging by the construction of new data centers by OpenAI and its competitors, I would say yes.

Are there limits and challenges to this growth? Certainly.

To be continued…

要查看或添加评论,请登录

Qualium Systems的更多文章

社区洞察

其他会员也浏览了