Optimizing LLM Training: Memory Management and Multi-GPU Techniques
Akash Chandra
Founder & CEO, InsightAI | Fintech | Machine Learning | DevOps | Secure Cloud Architect
Hello and welcome, folks! Here is my third blog, which dives deep into understanding the computing needed for running Large Language Models (LLMs). This blog will help you understand the memory requirements for LLMs, how to reduce their size, and some efficient ways to train them. So, let us get started.
In my last blog, Large Language Models Unveiled , we explored how different models work and the operations they perform best in. Encoder-only models efficiently handle tasks like sentiment analysis, named entity recognition, and word classification. Decoder-only models, which are autoregressive, excel at predicting the next word based on the previous context, making them very useful in content creation. Encoder-decoder models, or sequence-to-sequence models, are used for summarization and question-answering.
Larger models are more capable of performing tasks well, thus following the trend that size is proportional to capability. Based on the number of parameters, we have:
However, creating these large models comes with significant computational challenges, especially during training. One common issue encountered is the "Out of Memory Error: CUDA." Let's understand how much memory is actually needed to run these models.
Let us understand by calculation
Approximate GPU RAM needed to store 1 billion parameters:
1 parameter = 4 bytes (32-bit float)
1 billion parameters = 4 x 10^9 bytes = 4 GB
So, 4 GB of GPU RAM for full precision? No, wait!! there's more!
Additional components needed to train 1 B parameters:
So, the total memory allocation to train a model with 1 billion parameters will be 24 GB of GPU memory at 32-bit full precision, which is large and costly!
What can we do to reduce the memory needed for training? There are several methods to reduce model size and memory requirements. Let's explore a few:
Quantization:
In this method, we reduce the model size by lowering the storage precision from 32-bit floating-point numbers to 16-bit floating-point numbers.
Let us understand this in more detail. The data types generally used in models are:
Now, changing the variable type can reduce the memory needed but also decrease the precision, which might reduce model performance. Let me explain with a simple example:
Value of Pi:
So, to understand how much memory is saved, let us summarize Quantization:
So, how much does the model size reduce for 1 billion parameters:
Currently, another data type, BFLOAT16, developed by Google Brain Research, has become more popular in deep learning. Many LLMs, like FLAN-T5, have been trained with BFLOAT16. It is also supported by newer GPUs like the NVIDIA A100. BFLOAT16 is a custom 16-bit floating-point format for machine learning that is composed of one sign bit, eight exponent bits, and seven mantissa bits. The advantage of BFLOAT16 is that the dynamic range of BFLOAT16 and float32 are equivalent. However, BFLOAT16 takes up half the memory space.
We have discussed how we can reduce the size of the models using quantization. Now, let us look at how to train such large models efficiently.
Efficient Multi-GPU Training Strategies
To efficiently scale across GPUs, we can use the following techniques:
Distributed Data Parallel (DDP)
This is a popular and efficient technique for training across GPUs, which is a part of PyTorch. Here's how it works:
Fully Sharded Data Parallel (Model Sharding)
A popular implementation of data sharding is PyTorch’s ‘Fully Sharded Data Parallel’ (FSDP). The FSDP technique is inspired by a method proposed in a paper by Microsoft called ZeRO, which stands for Zero Redundancy Optimizer.
The goal of ZeRO is to distribute the model states across GPUs with zero data overlap. This allows the model to scale across GPUs even when it does not fit on a single chip.
ZeRO is more efficient than DDP because, in DDP, the model needs to be replicated across GPUs, causing redundancy and consuming more space. In contrast, ZeRO performs model sharding of parameters and weights, thereby minimizing overlap.
I hope this blog gave you a good insight into the computation, storage, and training of Large Language Models, as well as efficient GPU handling. If you would like to understand more about how large language models are, do visit my blog:
In the next blog in this series, we will dive deeper into fine-tuning.
Thank you for following my blog, and stay tuned for more informative content. If you would like to engage in a discussion on the use of Generative AI in industrial applications like Manufacturing and Finance, feel free to connect with me. I have developed some great solutions at InsightAI that provide deep insights from your data. Connect with me at [email protected] .
#LargeLanguageModels #ArtificialIntelligence #MachineLearning #TextGeneration #AIInsights #GenerativeAI #TechInnovation #DataScience #NLP (Natural Language Processing) #AITechnology
CTO at Trident Spark ??
3 个月Akash, thanks for sharing!
Co-Organizer @GDG On Campus GNIOT | Flutter Enthusiast l Gen AI Enthusiast l Cloud Enthusiast |
6 个月Very informative