登录查看更多内容

Building The World’s Greatest Recommender System Part 14: Training Distributed Machine Learning Models

Luke Zhuo

Software Engineer, ML at Meta

发布日期: 2024年5月11日

Machine learning has seemingly adopted the attitude that “bigger is better”. For example, Large Language Models (LLMs), such as GPT-4 and LLama-3 are multiples (i.e. 10X) larger in size than their predecessors. Recommender systems models also do not appear to be shrinking. This is for good reason: larger models are more expressive.?

However, as models become ever-larger, we encounter inherent challenges. Fundamentally, large-scale models cannot fit on a single processor (GPU), and resolving this issue is critical to training these models.?

So how do we train a model that cannot fit on a GPU?

Well, intuitively, we would look to break up the model: this is known as sharding.?

Given that a model is composed of layers, generally linear layers (wx + b) wrapped in a non-linear “activation” function (such as ReLU, Swish, or GeLU), a logical approach would be to break the model up by layers.?

This approach of breaking up the model by layers, putting each layer onto a different GPU, is also known as Vertical Model Parallel. This is a fairly common approach. As a result, it is supported by popular open source machine learning frameworks like PyTorch. To distribute a model to GPU n, using PyTorch with NVIDIA’s GPUs, we only need to add “.cuda(n)” to the end of our layer instantiation. For example, if we had GPU 0 and GPU 1, we could put a linear layer with 16 dimensional input (16 inputs) and 8 dimensional output (8 outputs) on GPU 0 with the command “nn.Linear(16, 8).cuda(0).” After instantiating all layers of the model, we would need to make sure to wrap them in “nn.Sequential” to declare them as being in a sequence, as this is necessary in order to implement Vertical Model Parallel.

However, we can actually further optimize Vertical Model Parallel. If we examine the model training process, we see that it consists of 2 steps: a forward pass and backpropagation, a backward pass. The forward pass generates a prediction. A loss function computes the error of this prediction when compared to the actual value. The loss is used to compute what is known as a gradient (direction of greatest increase in loss), and the negative of the gradient (direction of greatest decrease in loss) is used for backpropagation, a backward pass which updates the weights so that the loss can be decreased. Thus, with Vertical Model Parallel, each layer of a model, on a different GPU, performs forward propagation with the same data, sequentially, with later layers waiting for earlier ones. And, after forward propagation, each layer then performs backpropagation sequentially, with earlier layers waiting on later ones.?

F_n represents forward pass through layer n and B_n represents backward pass through layer n

As a consequence, the training progresses sequentially, with only one GPU doing work at any given time.?

How do we enable Vertical Model Parallel to utilize parallel, rather than sequential execution?

领英推荐

LLM Pulse - Nov 1, 2024

Blackstraw 4 个月前

How to apply FinOps to GenAI

Synyega 7 个月前

How Vector Databases Help Avoid Expensive, Eloquent…

KX 1 年前

We implement Pipeline Parallel. This involves splitting the data into multiple batches and passing each batch through the earliest layer of the model (layer 0). As soon as one batch is done passing through the earliest layer, the output of that batch can be passed through the next layer, while a new batch is being passed through the earliest layer of the model.?

For example, if we have 4 layers on 4 GPUs, we can have layer 0 (the first layer) working on batch 3, layer 1 working on batch 2, layer 2 working on batch 1, and layer 3 working on batch 0 (the first batch), all concurrently.?

We can apply the same approach to the backward pass, but with the last layer as the start layer.

F_n, m represents forward pass through layer n with batch m, and B_n, m represents backward pass through layer n with batch m

Through Pipeline Parallel, we can complete the forward pass and backward pass with parallelism, share gradients for gradient descent (we’ll touch on this in greater depth later on), and perform model weight updates in parallel. Thus, the approach not only enables the training of models too large to fit on a single GPU but also enables significant parallelization of model training, allowing for more efficient utilization of the multiple GPUs on which the model resides. Consequently, through approaches such as Pipeline Parallel, we are able to train ever-larger, ever-more expressive, and (hopefully) ever-improving machine learning models.

If you benefited from this post, please share so it can help others.

Sources (All Content Can Be Found In Publicly Available Information; No Secrets):

https://huggingface.co/transformers/v4.9.2/parallelism.html?

https://towardsdatascience.com/distributed-parallel-training-data-parallelism-and-model-parallelism-ec2d234e3214?

https://pytorch.org/docs/stable/pipeline.html?

https://www.dhirubhai.net/pulse/robotic-soldiers-just-sci-fi-concept-anymore-naveen-joshi/ (image)

要查看或添加评论，请登录

Luke Zhuo的更多文章

Building the World’s Greatest Recommender System Part 21: Caching to Avoid Repeated Work

2024年6月30日

Building the World’s Greatest Recommender System Part 21: Caching to Avoid Repeated Work

With every use of a 12-trillion-parameter deep learning recommendation model (DLRM), to match users with recommended…
Building the World’s Greatest Recommender System Part 20: Pacing Ourselves

2024年6月22日

Building the World’s Greatest Recommender System Part 20: Pacing Ourselves

Imagine a race with a finish line unknown to the runners. In their efforts to win such a race, competitive runners…
(Building the World’s Greatest Recommender System Part 19) Doing More with More: Feature Transforms (Part 2)

2024年6月16日

(Building the World’s Greatest Recommender System Part 19) Doing More with More: Feature Transforms (Part 2)

In this blog post, we continue answering the question of “What really goes into a machine learning model?” We will…
(Building the World’s Greatest Recommender System Part 18) Doing More with the Same: Feature Transforms

2024年6月9日

(Building the World’s Greatest Recommender System Part 18) Doing More with the Same: Feature Transforms

What really goes into a machine learning model? Machine learning models can be thought about as functions: input data…
Less is More: The Power of Quantization for Machine Learning Models

2024年6月1日

Less is More: The Power of Quantization for Machine Learning Models

Every day, ever-larger machine learning models train on ever-larger, ever-costlier GPU clusters. For example, training…
Building the World’s Greatest Recommender System Part 16: Synchronizing Distributed Model Training

2024年5月26日

Building the World’s Greatest Recommender System Part 16: Synchronizing Distributed Model Training

Synchronization between GPU’s to enabled balanced training speed, network consistency, straggler mitigation Be it for…
Building The World’s Greatest Recommender System Part 15: Distributed Data for Model Training

2024年5月19日

Building The World’s Greatest Recommender System Part 15: Distributed Data for Model Training

As machine learning progresses, models only seem to ingest more data. Recently, Meta advertised that Llama-3, the…
Building The World’s Greatest Recommender System Part 13: How Models Learn (In Theory)

2024年5月5日

Building The World’s Greatest Recommender System Part 13: How Models Learn (In Theory)

Distributed machine learning model training has become a hot topic as of late, as it is pivotal to recommender systems,…
Building the World’s Greatest Recommender System Part 12: Communicating Effectively (Among Deep Learning Servers)

2024年4月28日

Building the World’s Greatest Recommender System Part 12: Communicating Effectively (Among Deep Learning Servers)

It is a simple fact that we cannot hope to fit the processing jobs needed to extract the data to predict interactions…
Building the World’s Greatest Recommender System Part 11: What is Accuracy?

2024年4月20日

Building the World’s Greatest Recommender System Part 11: What is Accuracy?

To build a strong recommender system, we need to be able to quantify accuracy in a way that it is aligned with users…

See all articles

Building The World’s Greatest Recommender System Part 14: Training Distributed Machine Learning Models

Luke Zhuo

Software Engineer, ML at Meta

领英推荐

Luke Zhuo的更多文章

社区洞察

其他会员也浏览了

AI <Connect> Newsletter | Edition #6

AI Connect Newsletter | Edition #14

AI Innovations Galore: PyTorch's Hidet Compiler, Google's DeepMind Merge, and NimbleBox's ChainFury

AI IN THE ARENA

??AI scale slowdown,??AlphaFold3 open source, ??Reddit 100k reads per second

Apple Open-sources Apple Silicon-Optimized Machine Learning Framework MLX

Issue #318 - The ML Engineer ??

Tech Insights 2024 Week 52

Issue #223 - THE ML ENGINEER ??

Issue #220 - THE ML ENGINEER ??

领英推荐

Luke Zhuo的更多文章

Building the World’s Greatest Recommender System Part 21: Caching to Avoid Repeated Work

Building the World’s Greatest Recommender System Part 20: Pacing Ourselves

(Building the World’s Greatest Recommender System Part 19) Doing More with More: Feature Transforms (Part 2)

(Building the World’s Greatest Recommender System Part 18) Doing More with the Same: Feature Transforms

Less is More: The Power of Quantization for Machine Learning Models

Building the World’s Greatest Recommender System Part 16: Synchronizing Distributed Model Training

Building The World’s Greatest Recommender System Part 15: Distributed Data for Model Training

Building The World’s Greatest Recommender System Part 13: How Models Learn (In Theory)

Building the World’s Greatest Recommender System Part 12: Communicating Effectively (Among Deep Learning Servers)

Building the World’s Greatest Recommender System Part 11: What is Accuracy?

社区洞察

其他会员也浏览了

AI <Connect> Newsletter | Edition #6

AI Connect Newsletter | Edition #14

AI Innovations Galore: PyTorch's Hidet Compiler, Google's DeepMind Merge, and NimbleBox's ChainFury

AI IN THE ARENA

??AI scale slowdown,??AlphaFold3 open source, ??Reddit 100k reads per second

Apple Open-sources Apple Silicon-Optimized Machine Learning Framework MLX

Issue #318 - The ML Engineer ??

Tech Insights 2024 Week 52

Issue #223 - THE ML ENGINEER ??

Issue #220 - THE ML ENGINEER ??