How Big Deep Learning Models are Trained? A Book Review

Ashish Patel ????

?? 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers

发布日期: 2022年6月5日

During my #DataScience journey, I observed that the demand for hardware such as GPU, TPU, and VPU is increasing due to the high use of computer power used by GPT, BERT etc. It isn't easy to train a large dataset with ordinary hardware. Sometimes it isn't stress-free with Large computing hardware, such as GPU, etc.

So, Today I bring you this kind of book which helps you to know how this Giant Model are trained and widely practised by the best data scientists in the industry and the best companies like Amazon, Microsoft, Google, Nvidia, openAI etc.

Book: Distributed Machine Learning with Python by Guanhua Wang

Special Thanks to Shifa Ansari for providing me with this Amazing and Great Review Copy.

Why should you follow Distributed Machine Learning?

---------------------------------------------------------------------

??? It uses a multi-node ML system that improves performance, increases accuracy, and scales to large input data sizes.

??? This minimizes errors made by machines and allows individuals to make informed decisions and analyses from large amounts of data.

?How does this help Data Scientists improve their productivity?

---------------------------------------------------------------------

?? Used to run multiple experiments in parallel on multiple devices (GPUs/TPUs/servers).

?? Distributing the training of a single network across multiple devices dramatically reduces training time.

?My key points from this book

---------------------------------------------------------------------

?? This book covers three types of parallelism:

1.?????Data parallelism

2.????Model parallelism

3.????Its evolution with federated learning

?1. Data Parallelism

?? In today's world, large input data volumes result in a longer training time on a CPU or GPU (node) in a database such as ImageNet1K or CIFAR-100, and the standard practice for speeding up the sample training process is data parallelism.

Jozsef S. 1 年前

Speedup Machine Learning 1000 Times

Anthony Mai 8 年前

How to train your BERT model 5X faster than in Google…

Optumi 1 年前

?? How it works: During the training, each node or cluster holds a copy of the sample, and then its partitions convert the input data into disjoint subgroups, where each node is responsible for its own input partitions and each node is responsible for the data training of its own input partition.

?? Synchronization is an important step in data parallelism for machine learning (ML). Different synchronization systems and strategies work differently, and synchronization techniques are required according to the model configurations and cluster configurations to achieve optimal parallel training performance.

?? The parameter server?aggregates the gradients and computes the updates to the network parameters using some variant of Stochastic Gradient Descent. The updated parameters are then sent to each GPU(Node) and the process is repeated for a fresh mini-batch.

?? All-Reduce Paradigm is a parallel algorithm approach that aggregates the target array results from all processes conducted on each node independently into a single array. This aggregation can be either concatenation or summation, or any other operation that allows for independent parallel processing of arrays.

?? End to End Implementation has been provided in the book for Data Parallelism Training Pipeline with single, multi-GPUs, and Multi-machine-GPUs. They are also trying to provide solutions to shortcomings in current data-parallel pipelines.

?2. Model Parallelism:

?? Research Advancement on NLP and Computer Vision with Deep learning contains giant model architecture such as Bidirectional Encoder Representation Form Transformers (BERT), Generative Pre-Trainer Transformer (GPT, GPT2, GPT3). This model is large in size and not able to fit on a single GPU.

?? Language models are huge in size in the context of data parallelism and model parallelism is often a good approach this model required large computing hardware workstations such as Nvidia V100, Nvidia P100, Nvidia DGX Station A100(Super Computing Hardware) and Nvidia DGX-1.

?? Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or between instances.

?? Vanilla Parallelism is an approach which takes every layer forward and backward propagation process in different GPU nodes which is insufficient in training because it is not utilizing GPU Idle time. Therefore, more two methods were used:

1.?????Pipeline Parallelism: Pipeline parallelism improves both the memory and compute efficiency of deep learning training by partitioning the layers of a model into stages that can be processed in parallel.

2.????Intra-layer model parallelism (Tensor Parallelism): Tensor parallelism is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices.

?? You can improve the throughput and latency of the model by applying different technique such as layer freezing, model distillation (Teacher-student Model – which prune the redundant layers in DNN), and Reducing bits in hardware using the model quantization process.

?3. Advanced Parallelism Paradigms:

?? In Hybrid data-model parallelism first model parallelism is applied to calculate the state of hidden nodes for all steps in Deep learning Architecture such as Megatron-LM and then after applying data parallelism is to calculate the final score of the model.

?? Federated learning: Federated learning is a machine learning method that implements machine learning models to gain experience from different datasets located on different platforms (e.g., local data centres, central servers) without sharing training data.

?? Elastic Model Training transforms static monolithic training into a dynamic process that is resilient to failures and automatically scales GPU allocation while training.

?? There are many other advanced methods discussed in this book such as Kernel event Monitoring, Job Multiplexing, and heterogeneous model training.

?I hope you enjoy this amazing knowledge from this book.

?#data #machinelearning #deeplearning #artificialintelligence #analytics

Disha Mukherjee

2 年

Such a gem of a share ??

1 次回应

Bent Mathiesen

Passionate about IT Architecture, Research, Mathematics, ML/AI, Algorithms, Cloud, Containers, Network, Security, Programming, Linux.

2 年

Sometimes=never

1 次回应

Pedro José Mora Gallegos

Artificial Intelligence B.Sc. at THI

2 年

Thanks for sharing! ??

1 次回应

Elijah Kengura

Research and Acquisitions at Infinity Devices Africa Ltd

2 年

I am young in this Data science space , so far so good. Soon I'll stamp Authority in Machine Learning[ ML]. I am happy to be part of this great python family.

1 次回应

Ashish Patel ????

2 年

Code : https://github.com/PacktPublishing/Distributed-Machine-Learning-with-Python Book: https://www.packtpub.com/product/distributed-machine-learning-with-python/9781801815697

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How Big Deep Learning Models are Trained? A Book Review

Ashish Patel ????

?? 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Tools to Boost Machine Learning Capabilities

Put Tensorflow, Keras and MxNet Deeplearning Models on steroids

The Five Tribes of Machine Learning

What Hitchhiker’s Guide to the Galaxy Taught Me About Machine Learning

Apple Open-sources Apple Silicon-Optimized Machine Learning Framework MLX

# TOPICS OF MACHINE LEARNING - TENSORFLOW

Machine Learning Frameworks

A note on Machine Learning, from a real noob...

Ray is a fast and simple framework for building and running distributed applications

Architectural Resiliency of GNN Algorithms to Graph Structure

领英推荐

Generative AI with Amazon Bedrock: Enterprise LLMs Practise Guide

2024年7月29日

Training-Free Long-Context Scaling of Large Language Models

2024年6月3日

OpenELM: A Milestone in Open Source Language Modeling

2024年4月27日

The Art of Training LLMs: Navigating the Toolkit Beyond Rewards for LLMs

2024年1月12日

Exploring Mixtral 8x7B: Deep Dive into its Architectural Wonders

2023年12月15日

Discover the World of Graph Analytics: A Python Guide to Graph Data Modeling

2023年8月22日

MLOps Architectural view of MLOps on AWS

2023年4月27日

MLOps and Retail: A Match Made in Data Heaven

2023年2月12日

Exploring the World of Machine Learning: 35+ Types of Problems and How MLOps Can Boost Your Business

2023年1月8日

How MLOps Implementation Strategies Can Help Keep Your Business on Top

2023年1月1日

社区洞察

其他会员也浏览了

Tools to Boost Machine Learning Capabilities

Put Tensorflow, Keras and MxNet Deeplearning Models on steroids

The Five Tribes of Machine Learning

What Hitchhiker’s Guide to the Galaxy Taught Me About Machine Learning

Apple Open-sources Apple Silicon-Optimized Machine Learning Framework MLX

# TOPICS OF MACHINE LEARNING - TENSORFLOW

Machine Learning Frameworks

A note on Machine Learning, from a real noob...

Ray is a fast and simple framework for building and running distributed applications

Architectural Resiliency of GNN Algorithms to Graph Structure