Building The World’s Greatest Recommender System Part 14: Training Distributed Machine Learning Models
Machine learning has seemingly adopted the attitude that “bigger is better”. For example, Large Language Models (LLMs), such as GPT-4 and LLama-3 are multiples (i.e. 10X) larger in size than their predecessors. Recommender systems models also do not appear to be shrinking. This is for good reason: larger models are more expressive.?
However, as models become ever-larger, we encounter inherent challenges. Fundamentally, large-scale models cannot fit on a single processor (GPU), and resolving this issue is critical to training these models.?
So how do we train a model that cannot fit on a GPU?
Well, intuitively, we would look to break up the model: this is known as sharding.?
Given that a model is composed of layers, generally linear layers (wx + b) wrapped in a non-linear “activation” function (such as ReLU, Swish, or GeLU), a logical approach would be to break the model up by layers.?
This approach of breaking up the model by layers, putting each layer onto a different GPU, is also known as Vertical Model Parallel. This is a fairly common approach. As a result, it is supported by popular open source machine learning frameworks like PyTorch. To distribute a model to GPU n, using PyTorch with NVIDIA’s GPUs, we only need to add “.cuda(n)” to the end of our layer instantiation. For example, if we had GPU 0 and GPU 1, we could put a linear layer with 16 dimensional input (16 inputs) and 8 dimensional output (8 outputs) on GPU 0 with the command “nn.Linear(16, 8).cuda(0).” After instantiating all layers of the model, we would need to make sure to wrap them in “nn.Sequential” to declare them as being in a sequence, as this is necessary in order to implement Vertical Model Parallel.
However, we can actually further optimize Vertical Model Parallel. If we examine the model training process, we see that it consists of 2 steps: a forward pass and backpropagation, a backward pass. The forward pass generates a prediction. A loss function computes the error of this prediction when compared to the actual value. The loss is used to compute what is known as a gradient (direction of greatest increase in loss), and the negative of the gradient (direction of greatest decrease in loss) is used for backpropagation, a backward pass which updates the weights so that the loss can be decreased. Thus, with Vertical Model Parallel, each layer of a model, on a different GPU, performs forward propagation with the same data, sequentially, with later layers waiting for earlier ones. And, after forward propagation, each layer then performs backpropagation sequentially, with earlier layers waiting on later ones.?
F_n represents forward pass through layer n and B_n represents backward pass through layer n
As a consequence, the training progresses sequentially, with only one GPU doing work at any given time.?
How do we enable Vertical Model Parallel to utilize parallel, rather than sequential execution?
领英推荐
We implement Pipeline Parallel. This involves splitting the data into multiple batches and passing each batch through the earliest layer of the model (layer 0). As soon as one batch is done passing through the earliest layer, the output of that batch can be passed through the next layer, while a new batch is being passed through the earliest layer of the model.?
For example, if we have 4 layers on 4 GPUs, we can have layer 0 (the first layer) working on batch 3, layer 1 working on batch 2, layer 2 working on batch 1, and layer 3 working on batch 0 (the first batch), all concurrently.?
We can apply the same approach to the backward pass, but with the last layer as the start layer.
F_n, m represents forward pass through layer n with batch m, and B_n, m represents backward pass through layer n with batch m
Through Pipeline Parallel, we can complete the forward pass and backward pass with parallelism, share gradients for gradient descent (we’ll touch on this in greater depth later on), and perform model weight updates in parallel. Thus, the approach not only enables the training of models too large to fit on a single GPU but also enables significant parallelization of model training, allowing for more efficient utilization of the multiple GPUs on which the model resides. Consequently, through approaches such as Pipeline Parallel, we are able to train ever-larger, ever-more expressive, and (hopefully) ever-improving machine learning models.
If you benefited from this post, please share so it can help others.
Sources (All Content Can Be Found In Publicly Available Information; No Secrets):