Beyond Data and Model Parallelism: Sequence Parallelism with Scatter and Gather Patterns
As deep learning models continue to grow in size and complexity, efficiently training these massive networks has become a significant challenge. Traditional parallelism strategies like data parallelism and model parallelism have been instrumental but come with their own limitations. Enter sequence parallelism—a novel approach that addresses some of these constraints, offering a new avenue for optimizing large-scale model training.
The Traditional Approaches: Data and Model Parallelism
Data Parallelism involves splitting the input batched data across multiple GPUs. Each processor works independently on its portion of the data using the same model parameters. After computation, gradients are aggregated to update the model synchronously. This method is relatively straightforward and scales well with the number of processors. However, it is not designed for models that exceed the memory capacity of a single GPU.
Model Parallelism, on the other hand, partitions the model weights itself across multiple GPUs. Different layers or components of the model are allocated to different processors. While this allows for training larger models, it introduces significant communication overhead between GPUs, potentially leading to inefficiencies and slower training times.
Introducing Sequence Parallelism
Traditional strategies like data parallelism and model parallelism distribute workloads across multiple GPUs but often encounter limitations when dealing with very large models or long input sequences. In Sequence parallelism, the input sequence is split across multiple GPUs, allowing for efficient training of large models like transformers. Utilizing the scatter and gather design patterns, presents a novel solution by dividing input sequences across GPUs, enabling efficient training without overwhelming memory constraints.
How Sequence Parallelism Works
In sequence parallelism, an input sequence is divided into segments, each assigned to a different GPU. For instance, if you have a sequence of 1,000 tokens and four GPUs, each GPU processes 250 tokens. This approach keeps GPUs busy and optimizes training in the following way:
Local Computations
Each GPU independently computes the embeddings and initial layers for its segment of the sequence. This ensures that all GPUs are actively processing data without waiting on others, maximizing parallel efficiency.
Attention Mechanism with Communication
Transformers and similar models rely on attention mechanisms that require access to the entire sequence. To handle this:
Figure 2 shows the core design of DeepSpeed-Ulysses paper. As with the known transformer architecture, the design consists of input sequences NNN partitioned across PPP available devices. Each local N/PN/PN/P partition is projected into queries (Q), keys (K), and values (V) embeddings. Next, the QKV embeddings are gathered into global QKV through highly optimized all-to-all collectives between the participating compute devices. Following the all-to-all collective is the attention computation per head, expressed in the form:
Outputcontext = Softmax((QKT )/ p(d))V
After the attention computation, another all-to-all collective transforms the output context tensor of the attention computation to sequence N/PN/PN/P parallel for subsequent operators (MLP MatMul, layer norm, etc.) in the remaining modules of the transformer layer block.
领英推荐
Global Attention Computation
With the complete keys and values, each GPU computes attention scores for its queries against the entire sequence. This computation is intensive and fully utilizes the GPUs' capabilities. By employing the gather pattern, GPUs collect the necessary information to perform these computations, ensuring that each token can attend to every other token in the sequence.
Updating Representations and Continuing Computation
GPUs update their token representations using the attention outputs and proceed to process subsequent layers like feed-forward networks. This continuous computation ensures GPUs remain occupied, maintaining high efficiency throughout the training process.
Synchronization Points
After certain layers, GPUs synchronize to maintain model consistency. Efficient communication protocols minimize idle time during these synchronization phases, ensuring that the overall training process remains streamlined.
Backward Pass and Gradient Sharing
During training, each GPU computes gradients for its segment. Necessary gradients are exchanged between GPUs to update shared model parameters, keeping all GPUs engaged in both computation and communication. This collaboration ensures that the model converges correctly while maximizing resource utilization.
Efficiency and Scalability
By overlapping computation with communication, sequence parallelism maximizes GPU utilization. GPUs are either processing data or communicating essential information, significantly reducing idle times. This method allows for training larger models with longer sequences without exceeding individual GPU memory limits, effectively scaling deep learning models.
Real-World Implementations
Projects like NVIDIA's Megatron-LM and Microsoft's DeepSpeed have successfully implemented sequence parallelism using scatter and gather patterns:
Benefits of Sequence Parallelism with Scatter and Gather
By employing the scatter and gather design patterns, sequence parallelism offers several benefits:
Thanks to Bharat Singh for explaining these concepts and providing valuable insights into sequence parallelism and the use of scatter and gather design patterns.