Scaling laws
A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g., model performance) is proportional to another variable (e.g., model size) raised to a certain power. This can be represented as: Y = K * X^a, where Y is the dependent variable, K is a constant, X is the independent variable, and a is the scaling exponent.
OpenAI
When OpenAI team lead by Kaplan released the paper "Scaling Laws for Neural Language Models" the focused on how they are seeing improvement in the model as the size of the model increases and the dataset increases. The focus was on estimating the compute required to train these models.
Google joins the party
This lead to the first empirical study on how we can scale these LLMs. However, a pathbreaking paper from Google "Training Compute-Optimal Large Language Models".
This is an excellent paper which showed that GPT-3 is sub-optimal in its training and models much smaller can reach the same quality. This paper is popularly referred as Chinchilla optimality where they trained models much smaller and showed the same efficiency.
They suggests a more integrated approach, combining model size and data size in a single equation: L(N, D) = E + AN^α + BD^β, where E = 1.69, A = 406.4, B = 410.7, α = 0.34, and β = 0.28. These values were empirically estimated by trying different model size, dataset etc.
Interestingly, the key author of this paper Arthur Mensch is the founder of Mistral AI and is showing that smaller models can make bigger impact.
Next steps
A new paper "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" from DeepSeek improves upon the Chinchilla paper. First they conducted a grid search for batch size and learning rate on small-scale experiments with a compute budget of 1e17, and the results of a specific model size (177M FLOPs/token) are shown below.
In earlier work the relationship between compute budget ?? and model/data scale could be approximately described as ?? = 6??D where, D is the data scale and N is the number of parameters. Typically, we split into non-embedding parameters ??1 (OpenAI) and complete parameters ??2 (Chinchilla) for computing N.
Here they have introduced a new model scale representation: non-embedding FLOPs/token ??. ?? includes the computational overhead of attention operation but does not take into account the vocabulary computation. With the model scale represented by ??, the compute budget ?? can be simply expressed as ?? = ????.
Given a computing budget ?? = ????, find the optimal model scale ??opt and data scale ??opt that minimize the generalization error of the model.
Using this formula they come up with 2 different models
They have trained DeepSeek LLM 7B and 67B (GitHub - deepseek-ai/DeepSeek-LLM: DeepSeek LLM: Let there be answers)
So now we have some update on the scaling laws. But, the central point of data quality remains. That seems to be the most important factor for LLM quality, not the # of tokens.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
1 年You mentioned an update to scaling laws in the context of LLMs and generative AI. Scaling laws have historically played a significant role in understanding the efficiency and capabilities of large models. It's fascinating to see how advancements in AI continue to challenge and refine these laws. How do you perceive these updates influencing the development and deployment of large-scale AI models in various industries, and do you foresee any potential breakthroughs or limitations in this regard?