登录查看更多内容

Scaling laws

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

发布日期: 2024年2月4日

A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g., model performance) is proportional to another variable (e.g., model size) raised to a certain power. This can be represented as: Y = K * X^a, where Y is the dependent variable, K is a constant, X is the independent variable, and a is the scaling exponent.

OpenAI

When OpenAI team lead by Kaplan released the paper "Scaling Laws for Neural Language Models" the focused on how they are seeing improvement in the model as the size of the model increases and the dataset increases. The focus was on estimating the compute required to train these models.

Google joins the party

This lead to the first empirical study on how we can scale these LLMs. However, a pathbreaking paper from Google "Training Compute-Optimal Large Language Models".

This is an excellent paper which showed that GPT-3 is sub-optimal in its training and models much smaller can reach the same quality. This paper is popularly referred as Chinchilla optimality where they trained models much smaller and showed the same efficiency.

They suggests a more integrated approach, combining model size and data size in a single equation: L(N, D) = E + AN^α + BD^β, where E = 1.69, A = 406.4, B = 410.7, α = 0.34, and β = 0.28. These values were empirically estimated by trying different model size, dataset etc.

Interestingly, the key author of this paper Arthur Mensch is the founder of Mistral AI and is showing that smaller models can make bigger impact.

Next steps

A new paper "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" from DeepSeek improves upon the Chinchilla paper. First they conducted a grid search for batch size and learning rate on small-scale experiments with a compute budget of 1e17, and the results of a specific model size (177M FLOPs/token) are shown below.

领英推荐

AI's Dawn of Reason

Singularity University 4 个月前

The NeRF Training Breakthrough and more

Lightning AI 1 年前

??Top ML Papers of the Week

DAIR.AI 9 个月前

In earlier work the relationship between compute budget ?? and model/data scale could be approximately described as ?? = 6??D where, D is the data scale and N is the number of parameters. Typically, we split into non-embedding parameters ??1 (OpenAI) and complete parameters ??2 (Chinchilla) for computing N.

Here they have introduced a new model scale representation: non-embedding FLOPs/token ??. ?? includes the computational overhead of attention operation but does not take into account the vocabulary computation. With the model scale represented by ??, the compute budget ?? can be simply expressed as ?? = ????.

Given a computing budget ?? = ????, find the optimal model scale ??opt and data scale ??opt that minimize the generalization error of the model.

Using this formula they come up with 2 different models

They have trained DeepSeek LLM 7B and 67B (GitHub - deepseek-ai/DeepSeek-LLM: DeepSeek LLM: Let there be answers)

So now we have some update on the scaling laws. But, the central point of data quality remains. That seems to be the most important factor for LLM quality, not the # of tokens.

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 年

You mentioned an update to scaling laws in the context of LLMs and generative AI. Scaling laws have historically played a significant role in understanding the efficiency and capabilities of large models. It's fascinating to see how advancements in AI continue to challenge and refine these laws. How do you perceive these updates influencing the development and deployment of large-scale AI models in various industries, and do you foresee any potential breakthroughs or limitations in this regard?

要查看或添加评论，请登录

Rajeswaran V (PhD)的更多文章

Copy of GenAI/LLM and productivity

2024年1月21日

Copy of GenAI/LLM and productivity

I will present 3 papers which discuss this from economics point of view. The productivity J-Curve "THE PRODUCTIVITY…
Paper clip maximization

2024年1月14日

Paper clip maximization

There is an very interesting thought experiment called "Paper clip maximization" This is a thought experiment by…
AI and research

2024年1月9日

AI and research

Microsoft performed a lot of experiments with GPT-4 and released the results in the paper titled "The Impact of Large…
Moravec's paradox and CV

2024年1月5日

Moravec's paradox and CV

I want to discuss face recognition and how it fits in with Moravec's paradox. Background Steven Pinker writes "The main…
AI robustness

2024年1月3日

AI robustness

When we build AI systems - care should be taken to test its robustness. A decentralized group of safe streets activists…
AI for Software Engineering

2024年1月2日

AI for Software Engineering

For corporates, Software Engineering lifecycle is most important. This is most relevant for IT majors on where and how…
AI in 2024 - some predictions

2024年1月1日

AI in 2024 - some predictions

There is an old saying "Prediction is very difficult. Especially about the future !".
Dangers of over-simplification

2023年12月31日

Dangers of over-simplification

In 2021 Sam Altman wrote an essay "Moore's Law for Everything". It gives some insight into his thinking on how AI will…
LLMs and Theory of mind

2023年12月31日

LLMs and Theory of mind

In March when researchers in Stanford published the paper "Theory of Mind Might Have Spontaneously Emerged in Large…
Evolving AGI benchmarks

2023年12月29日

Evolving AGI benchmarks

We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well…

See all articles

Scaling laws

Rajeswaran V (PhD)

Generative AI specialist. AI Futures and AI CoE head

OpenAI

Google joins the party

Next steps

领英推荐

Rajeswaran V (PhD)的更多文章

社区洞察

其他会员也浏览了

15x Faster than Llama 2: DeciLM, a NAS-Generated LLM with Variable GQA

ML Papers of The Week (Jan 1-8)

New LLM Pre-training and Post-training Paradigms

AI and Machine Learning in 2024

Tech Talk. Technology of the future

Deep Learning: GANs and Variationally Autoencoders

China's Answer to GPT-3 May in fact be Better

How do you run Deepseek (or any LLM) locally on your PC?

First impressions of OpenAI o1: An AI designed to overthink it

GenAI Weekly — Edition 8

OpenAI

Google joins the party

Next steps

领英推荐

Rajeswaran V (PhD)的更多文章

Copy of GenAI/LLM and productivity

Paper clip maximization

AI and research

Moravec's paradox and CV

AI robustness

AI for Software Engineering

AI in 2024 - some predictions

Dangers of over-simplification

LLMs and Theory of mind

Evolving AGI benchmarks

社区洞察

其他会员也浏览了

15x Faster than Llama 2: DeciLM, a NAS-Generated LLM with Variable GQA

ML Papers of The Week (Jan 1-8)

New LLM Pre-training and Post-training Paradigms

AI and Machine Learning in 2024

Tech Talk. Technology of the future

Deep Learning: GANs and Variationally Autoencoders

China's Answer to GPT-3 May in fact be Better

How do you run Deepseek (or any LLM) locally on your PC?

First impressions of OpenAI o1: An AI designed to overthink it

GenAI Weekly — Edition 8