Scaling laws of LLM based Generative AI Models

Scaling laws of LLM based Generative AI Models

Introduction?

Almost all of the LLM based Generative AI models use the same deep learning architecture - the decoder model of transformer architecture?propounded in “Attention is All you Need” paper.?Whether it is GPT-4 or Tiny Llama, the underlying deep learning architecture?is same. ?To improve the performance of the LLM based Generative AI models, there could be several theoretical options available. Researchers found however that performance depends strongly on scale, weakly on model shape. ??

Researchers found that using bigger models, with bigger datasets and training for longer duration led to significant performance benefits. More specifically, Model Size in terms of total number of parameters (N), Size of dataset in terms of number of tokens (D) and Compute used to train the model ( C) are the three factors that had major impact on the model performance. By scaling these three factors, model performance could be significantly improved. ??The computing infrastructure refers to the computing budget measured by FLOPs (Floating Point Operations per Second). The FLOPS required for a model is roughly given by the product of number of parameters (N) and Size of Dataset (D) multiplied by 6. The dataset size is measured by the number of tokens /?training tokens?where training tokens is product of number of tokens and training steps.??

However, there are constraints to all these factors. Bigger model sizes have impact on inference budget. Huge models require bigger RAM for training and inference infrastructure. Increasing dataset size has its own limit - it is said that entire internet data would amount to only few trillion tokens. Compute budget is also limited, the GPU infrastructure is costly and scarce and once training starts it would be a wastage of resources, if the model does not converge to optimal performance. Give these natural constraints, it becomes important to understand the relationship between these parameters for optimal model performance.?

There are several questions related to model optimization that are important:?

1. What should be relationship between Model Loss (L) and the Model Size (N)? How does the model performance behave when we increase the model size??

2. What is the benefit of increasing the size of the dataset (N)? Is it more efficient to train for longer duration on high quality but smaller size dataset??

3. Most of the large scale LLM based generated AI models are trained less than an epoch. How does training for longer duration impacts the model performance? Does the model loss decreases linearly to zero when we train for longer duration and provides ?better and better performance??

4. What is the relationship between model size and dataset size given a fixed compute budget? In other words, what is the optimal tokens per model parameter??

5. When we increase the compute budget, for optimal model performance, how should be we increase the model size and dataset size? Given that the compute budget is given in terms of FLOPs, it is proportional to the size of the model and the dataset. Should we increase the model size or dataset set size or both? If we increase both, what should be the relationship??

This article reviews the current research insights with regard to answers to the above questions.?

Kaplan?Scaling?Laws?

The first major research on this subject was carried out by OpenAI researchers and their results were published in their paper (Scaling Laws for Neural Language Models, Kaplan et. al., 2020). A major research insight was that the?improvements in performance is not a simple linear relationship but followed a power-law relationship i.e. even a small increase in these parameters led to major improvements. The conclusions drawn in the above research paper is?widely referred to as the Kaplan?scaling laws.?

Some important conclusions are:?

1. Scaling Laws: Performance of LLM models has a power-law relationship with each of the three scale factors, with diminishing returns before reaching zero loss?

2. Universality of overfitting: Performance improves predictably as long as N and D are scaled up in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases.?

3. Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points.?

4. Inefficient convergence: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large model and stopping significantly short of convergence?

OpenAI researchers concluded that it is always better to have a bigger model size compared to dataset size - i.e. it is better to have a bigger model and train for lesser duration (even less than an epoch) than to have smaller model with bigger dataset size. In other words, it is more optimal to stop training before convergence by having a bigger model size.?

Based on the conclusions derived by the above paper, it became an industry practice to train larger models with bigger dataset size. The maximum reasonable dataset size was 300bn tokens (MassiveText dataset that contains 2.35 bn documents) and the model size as per above laws indicated a size of 175 bn parameters.?

Answers to other questions raised in previous section are as follows:?

1. Relationship between model size and dataset size given a fixed compute budget - ?It was found that the relationship implied that for optimal performance every parameter can take in 1.7 times the dataset tokens (not training tokens). A 175 billion parameter model should be trained on a dataset of size 300 billion tokens for optimal performance.?

2. Relationship between compute budget and other factors?- Suppose if we increase computing budget by twice the amount, how much should we increase the model size / dataset size? The above insights imply that both should be increased in tandem, but given diminishing returns at higher levels, what should be the relationship between N and D? OpenAI researchers in the above paper concluded that if we improve model size by 8x, the dataset size should be increased by 5x.?

However, both the above insights, and also many of the conclusions listed above were proved incorrect by subsequent researchers!?

Laws?of Chinchilla?

A subsequent research by Deep Mind researchers?(Training Compute-Optimal Large Language Models, Hoffmann et al., 2022)?found interesting insights which were in contrast to the OpenAI’s Kaplan?Scaling laws. The researchers found that the optimal number of training tokens required for each parameter should be around 20, as against the 1.7 thumb rule followed earlier! (Note, however, that D in Kaplan scaling laws refer to dataset size while in the Chinchilla paper, D refers to total number of training tokens (number of parameters (D) multiplied by training steps (S)). ?This implied that all the, ?then?existing,?models with parameters in the range of 175B and above were all under-trained, as the training of these models were restricted by dataset size relationship. In other words, Chinchilla’s law indicated that the existing big GPT models were of inflated size.?

Another important relationship explored by the Deep Mind researchers was given a fixed computing budget defined by FLOPS, how should an increase in the budget be apportioned between model size and dataset size. Kaplan?scaling laws stated that if computing budget is increased by 10 times, the model size and dataset size should be increased by 5 times and 2 times respectively. ?In contrast, Chinchilla paper concluded that both size of model and size of training tokens should be equally increased implying that a 10?times increase in budget required square root of 10 times increase in model size and square root of 10 times increase in dataset size.?

Comparing the insights?

Kaplan scaling laws focused on sample efficiency - “Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps and using fewer data points”. ?On the other hand, Chinchilla paper suggested that it is better to take a smaller model and train it for 20 times more training tokens than the parameter size. Based on this premise, it took the 280 bn gopher model’s compute budget and trained a smaller model of 70B size but trained for four times more data. This model, termed Chinchilla, outperformed Gopher and all the bigger models trained according to Kaplan’s law.?

The two papers imply that we can either take bigger model and train it for smaller tokens with a focus on sample efficiency, ?or take a smaller model and train it for 20 times more training tokens. Both give same performance. However, they have implications for compute infrastructure. Models based on Kaplan?scaling?laws?require greater inference infrastructure while chinchilla “uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage”.???

Is the Chinchilla Optimal regime appropriate??

Subsequent studies have attempted to push the boundaries of total training tokens while keeping model size constant, surpassing the optimal range proposed by the Chinchilla paper. Meta's research revealed that models trained with three times more tokens than the Chinchilla optimal range exhibited improved performance, casting doubt on the validity of Chinchilla's findings. The Llama 2 paper posited that there appears to be no saturation point even with continued increases in training tokens.?

A Chinchilla optimal model typically requires 15 to 25 training tokens per parameter for training. Meta's Llama3 70B model was trained with approximately 200 tokens per parameter, representing roughly 10 times the Chinchilla benchmark. Consequently, Llama3 70B would theoretically necessitate around three times the training compute, yet only about 27% of the parameters (resulting in inference being approximately 3.7 times more cost-effective) to achieve comparable loss. While these figures are approximations, they provide a close estimate. Alternatively, one could interpret this as performing akin to a Chinchilla optimal model with approximately 260 billion parameters, which helps elucidate its exceptional performance on benchmarks.?

Microsoft's Phi-3 model pushes boundaries further by training with 870 training tokens per parameter, approximately 45 times the Chinchilla benchmark. Llama 3 8B takes this a step further, utilizing 75 times the tokens per parameter. Although these models require roughly 10 to 15 times more training compute, they only utilize about 20% of the parameters (thus delivering 5 times better inference performance) compared to an equivalent Chinchilla optimal model. According to scaling laws, further increases beyond these thresholds may be limited, as parameter counts drop below 20%, suggesting an asymptotic trend. The technical report for Phi-3 3.8B also hints at the challenges encountered near this parameter budget, indicating that this limit might persist unless there are architectural changes to the models.?

Tiny Llama Experiment?

Since the Chinchilla’s laws are not applicable, Tiny Llama experiment tried to compress maximum learning into a small model. ?The Tiny Llama experiment took a 1 bn parameter model and trained it for 3 trillion tokens (far beyond the 20x recommended by Chinchilla as optimal). However, the model could not perform as well compared to either chinchilla or GPT models. This should be obvious as reasoning abilities require multi layer language representations to facilitate emergent abilities and a one billion parameter would not be sufficient to pack all those model learning required.?

Optimal Model?

If our objective is to have good reasoning abilities even if the model size or internal knowledge is just sufficient enough, what should be the model size? Such a model can be trained for optimal dataset size and training tokens to achieve optimal performance. ?Researchers believe that such a model should score at least 60% on MMLU benchmark. It was found that a score of 60% on MMLU implied that the model would exhibit emergent abilities and can perform well as a reasoning engine.?

Even early GPT-3 and Gopher models could achieve only around 60% MMLU score, but Chinchilla could achieve equivalent MMLU score. ?Now that chinchilla laws are also not applicable, can we achieve 60% MMLU on much smaller models that can run on a laptop? This was the mission with which Mistral AI was founded, which led to Mistral 8B that had comparable MMLU score with bigger models.

Implications for evolution of AI Models?

At present, the evolution of AI Research can be categorized into three categories: one driven by business needs that aim to use LLM workflow architecture that uses Agents and Tools; the second ?is based on the possible evolution of autonomous agents of the kind of Terminator movie fame?- Sky net?and third is the quest for Artificial General Intelligence (AGI) which can supposedly solve problems faced by?humanity by helping in research.??

While AGI quest seems to go towards bigger and bigger models, business use cases might lead to invention of smaller models of the type Llama 8B and Mistral 8B which may just serve as reasoning engines. Models that seem to lead to autonomous agents might have a size in-between these spectrum.?

Conclusions?

Researchers have found that scaling the LLM based Generative AI models on model size, dataset size and compute budget could lead to tremendous improvement in performance, even without any changes in the model architecture. Initial research conducted by by OpenAI researchers concluded that there could be diminishing returns beyond a point as we increase model size and dataset size in tandem. This led to an optimal relationship between dataset size and model size that resulted in huge inflated GPT models. Subsequent research as presented in Chinchilla paper found that smaller models can be trained for 20 times more tokens to achieve the same performance as that of large models with smaller training tokens.

The initial research did not consider much about the compute infrastructure or training tokens. By changing the learning rate schedule and increasing the training steps, it was found that optimal region goes far beyond the levels indicated by Kaplan scaling laws. However, subsequent research by Meta proved that even training for tokens 200 times the model size did not led to saturation point. While experiments like Tiny Llama failed when it was tried to compact the ?entire learning in 1 bn parameter, models like Phi from Microsoft has tested the boundaries of saturation.?

The quest for smaller models that can fit into laptop and models that an perform as reasoning engines have succeeded with models like Llama3 8B, Mistral 8B and Phi models. However, the quest for AGI and models that can support autonomous agents still seem to require bigger models and bigger datasets.

We might have reached the limits of benefits of transformer based architecture that could easily learn and store multi-dimensional knowledge representations. The maximum size of the dataset is also limited with entire information contained in internet being used for training.??It is doubtful whether AGI is reachable by simply scaling the three factors. I think AGI might require working further, and over and above the best performing foundation models with specially created synthetic datasets.

Having reached the entropy loss of existing internet data, in future, researchers might have to arrive at carefully and manually crafted synthetic datasets specifically designed for specific AI tasks (that might work on top of foundation models) which might help reach further improvements in model performance with regard to AI related tasks. This could also be helped in significant measure by architectural improvements.

要查看或添加评论,请登录

Murugesan Narayanaswamy的更多文章

社区洞察

其他会员也浏览了