Scaling laws
Taken from https://arxiv.org/pdf/2001.08361.pdf

Scaling laws

A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g., model performance) is proportional to another variable (e.g., model size) raised to a certain power. This can be represented as: Y = K * X^a, where Y is the dependent variable, K is a constant, X is the independent variable, and a is the scaling exponent.

OpenAI

When OpenAI team lead by Kaplan released the paper "Scaling Laws for Neural Language Models" the focused on how they are seeing improvement in the model as the size of the model increases and the dataset increases. The focus was on estimating the compute required to train these models.

Google joins the party

This lead to the first empirical study on how we can scale these LLMs. However, a pathbreaking paper from Google "Training Compute-Optimal Large Language Models".

This is an excellent paper which showed that GPT-3 is sub-optimal in its training and models much smaller can reach the same quality. This paper is popularly referred as Chinchilla optimality where they trained models much smaller and showed the same efficiency.

They suggests a more integrated approach, combining model size and data size in a single equation: L(N, D) = E + AN^α + BD^β, where E = 1.69, A = 406.4, B = 410.7, α = 0.34, and β = 0.28. These values were empirically estimated by trying different model size, dataset etc.

Interestingly, the key author of this paper Arthur Mensch is the founder of Mistral AI and is showing that smaller models can make bigger impact.

Next steps

A new paper "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" from DeepSeek improves upon the Chinchilla paper. First they conducted a grid search for batch size and learning rate on small-scale experiments with a compute budget of 1e17, and the results of a specific model size (177M FLOPs/token) are shown below.

2401.02954.pdf (



In earlier work the relationship between compute budget ?? and model/data scale could be approximately described as ?? = 6??D where, D is the data scale and N is the number of parameters. Typically, we split into non-embedding parameters ??1 (OpenAI) and complete parameters ??2 (Chinchilla) for computing N.

Here they have introduced a new model scale representation: non-embedding FLOPs/token ??. ?? includes the computational overhead of attention operation but does not take into account the vocabulary computation. With the model scale represented by ??, the compute budget ?? can be simply expressed as ?? = ????.

Given a computing budget ?? = ????, find the optimal model scale ??opt and data scale ??opt that minimize the generalization error of the model.

Using this formula they come up with 2 different models

They have trained DeepSeek LLM 7B and 67B (GitHub - deepseek-ai/DeepSeek-LLM: DeepSeek LLM: Let there be answers)

So now we have some update on the scaling laws. But, the central point of data quality remains. That seems to be the most important factor for LLM quality, not the # of tokens.


Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 年

You mentioned an update to scaling laws in the context of LLMs and generative AI. Scaling laws have historically played a significant role in understanding the efficiency and capabilities of large models. It's fascinating to see how advancements in AI continue to challenge and refine these laws. How do you perceive these updates influencing the development and deployment of large-scale AI models in various industries, and do you foresee any potential breakthroughs or limitations in this regard?

回复

要查看或添加评论,请登录

Rajeswaran V (PhD)的更多文章

  • Copy of GenAI/LLM and productivity

    Copy of GenAI/LLM and productivity

    I will present 3 papers which discuss this from economics point of view. The productivity J-Curve "THE PRODUCTIVITY…

  • Paper clip maximization

    Paper clip maximization

    There is an very interesting thought experiment called "Paper clip maximization" This is a thought experiment by…

  • AI and research

    AI and research

    Microsoft performed a lot of experiments with GPT-4 and released the results in the paper titled "The Impact of Large…

  • Moravec's paradox and CV

    Moravec's paradox and CV

    I want to discuss face recognition and how it fits in with Moravec's paradox. Background Steven Pinker writes "The main…

  • AI robustness

    AI robustness

    When we build AI systems - care should be taken to test its robustness. A decentralized group of safe streets activists…

  • AI for Software Engineering

    AI for Software Engineering

    For corporates, Software Engineering lifecycle is most important. This is most relevant for IT majors on where and how…

  • AI in 2024 - some predictions

    AI in 2024 - some predictions

    There is an old saying "Prediction is very difficult. Especially about the future !".

  • Dangers of over-simplification

    Dangers of over-simplification

    In 2021 Sam Altman wrote an essay "Moore's Law for Everything". It gives some insight into his thinking on how AI will…

  • LLMs and Theory of mind

    LLMs and Theory of mind

    In March when researchers in Stanford published the paper "Theory of Mind Might Have Spontaneously Emerged in Large…

  • Evolving AGI benchmarks

    Evolving AGI benchmarks

    We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well…

社区洞察

其他会员也浏览了