How Big Deep Learning Models are Trained? A Book Review

How Big Deep Learning Models are Trained? A Book Review

During my #DataScience journey, I observed that the demand for hardware such as GPU, TPU, and VPU is increasing due to the high use of computer power used by GPT, BERT etc. It isn't easy to train a large dataset with ordinary hardware. Sometimes it isn't stress-free with Large computing hardware, such as GPU, etc.

So, Today I bring you this kind of book which helps you to know how this Giant Model are trained and widely practised by the best data scientists in the industry and the best companies like Amazon, Microsoft, Google, Nvidia, openAI etc.

Book: Distributed Machine Learning with Python by Guanhua Wang

Special Thanks to Shifa Ansari for providing me with this Amazing and Great Review Copy.

No alt text provided for this image

Why should you follow Distributed Machine Learning?

---------------------------------------------------------------------

??? It uses a multi-node ML system that improves performance, increases accuracy, and scales to large input data sizes.

??? This minimizes errors made by machines and allows individuals to make informed decisions and analyses from large amounts of data.

?How does this help Data Scientists improve their productivity?

---------------------------------------------------------------------

?? Used to run multiple experiments in parallel on multiple devices (GPUs/TPUs/servers).

?? Distributing the training of a single network across multiple devices dramatically reduces training time.

?My key points from this book

---------------------------------------------------------------------

?? This book covers three types of parallelism:

1.?????Data parallelism

2.????Model parallelism

3.????Its evolution with federated learning

?1. Data Parallelism

No alt text provided for this image

?? In today's world, large input data volumes result in a longer training time on a CPU or GPU (node) in a database such as ImageNet1K or CIFAR-100, and the standard practice for speeding up the sample training process is data parallelism.

?? How it works: During the training, each node or cluster holds a copy of the sample, and then its partitions convert the input data into disjoint subgroups, where each node is responsible for its own input partitions and each node is responsible for the data training of its own input partition.

?? Synchronization is an important step in data parallelism for machine learning (ML). Different synchronization systems and strategies work differently, and synchronization techniques are required according to the model configurations and cluster configurations to achieve optimal parallel training performance.

?? The parameter server?aggregates the gradients and computes the updates to the network parameters using some variant of Stochastic Gradient Descent. The updated parameters are then sent to each GPU(Node) and the process is repeated for a fresh mini-batch.

?? All-Reduce Paradigm is a parallel algorithm approach that aggregates the target array results from all processes conducted on each node independently into a single array. This aggregation can be either concatenation or summation, or any other operation that allows for independent parallel processing of arrays.

?? End to End Implementation has been provided in the book for Data Parallelism Training Pipeline with single, multi-GPUs, and Multi-machine-GPUs. They are also trying to provide solutions to shortcomings in current data-parallel pipelines.

?2. Model Parallelism:

No alt text provided for this image

?? Research Advancement on NLP and Computer Vision with Deep learning contains giant model architecture such as Bidirectional Encoder Representation Form Transformers (BERT), Generative Pre-Trainer Transformer (GPT, GPT2, GPT3). This model is large in size and not able to fit on a single GPU.

?? Language models are huge in size in the context of data parallelism and model parallelism is often a good approach this model required large computing hardware workstations such as Nvidia V100, Nvidia P100, Nvidia DGX Station A100(Super Computing Hardware) and Nvidia DGX-1.

?? Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or between instances.

?? Vanilla Parallelism is an approach which takes every layer forward and backward propagation process in different GPU nodes which is insufficient in training because it is not utilizing GPU Idle time. Therefore, more two methods were used:

1.?????Pipeline Parallelism: Pipeline parallelism improves both the memory and compute efficiency of deep learning training by partitioning the layers of a model into stages that can be processed in parallel.

2.????Intra-layer model parallelism (Tensor Parallelism): Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices.

?? You can improve the throughput and latency of the model by applying different technique such as layer freezing, model distillation (Teacher-student Model – which prune the redundant layers in DNN), and Reducing bits in hardware using the model quantization process.

?3. Advanced Parallelism Paradigms:

No alt text provided for this image

?? In Hybrid data-model parallelism first model parallelism is applied to calculate the state of hidden nodes for all steps in Deep learning Architecture such as Megatron-LM and then after applying data parallelism is to calculate the final score of the model.

?? Federated learning: Federated learning is a machine learning method that implements machine learning models to gain experience from different datasets located on different platforms (e.g., local data centres, central servers) without sharing training data.

?? Elastic Model Training transforms static monolithic training into a dynamic process that is resilient to failures and automatically scales GPU allocation while training.

?? There are many other advanced methods discussed in this book such as Kernel event Monitoring, Job Multiplexing, and heterogeneous model training.

?I hope you enjoy this amazing knowledge from this book.

?#data #machinelearning #deeplearning #artificialintelligence #analytics

Disha Mukherjee

Data Engineer @Just Eat Takeaway | Ex-bigspark, PayU | Speaker & Thought Leader | Mentor @ Learnbay | Women in Tech & DEI Advocate | FinTech & Cloud Data Specialist | Promoter of Mental & Women’s Health

2 年

Such a gem of a share ??

Bent Mathiesen

Passionate about IT Architecture, Research, Mathematics, ML/AI, Algorithms, Cloud, Containers, Network, Security, Programming, Linux.

2 年

Sometimes=never

Pedro José Mora Gallegos

Artificial Intelligence B.Sc. at THI

2 年

Thanks for sharing! ??

Elijah Kengura

Research and Acquisitions at Infinity Devices Africa Ltd

2 年

I am young in this Data science space , so far so good. Soon I'll stamp Authority in Machine Learning[ ML]. I am happy to be part of this great python family.

Ashish Patel ????

?? 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers

2 年

要查看或添加评论,请登录

社区洞察

其他会员也浏览了