Optimizing LLM Training: Memory Management and Multi-GPU Techniques

Optimizing LLM Training: Memory Management and Multi-GPU Techniques

Hello and welcome, folks! Here is my third blog, which dives deep into understanding the computing needed for running Large Language Models (LLMs). This blog will help you understand the memory requirements for LLMs, how to reduce their size, and some efficient ways to train them. So, let us get started.

In my last blog, Large Language Models Unveiled , we explored how different models work and the operations they perform best in. Encoder-only models efficiently handle tasks like sentiment analysis, named entity recognition, and word classification. Decoder-only models, which are autoregressive, excel at predicting the next word based on the previous context, making them very useful in content creation. Encoder-decoder models, or sequence-to-sequence models, are used for summarization and question-answering.

Larger models are more capable of performing tasks well, thus following the trend that size is proportional to capability. Based on the number of parameters, we have:

  • BERT: 340M
  • GPT-2: 1.5B
  • GPT-3: 175B
  • PaLM: 540B

However, creating these large models comes with significant computational challenges, especially during training. One common issue encountered is the "Out of Memory Error: CUDA." Let's understand how much memory is actually needed to run these models.

Let us understand by calculation

Approximate GPU RAM needed to store 1 billion parameters:

1 parameter = 4 bytes (32-bit float)

1 billion parameters = 4 x 10^9 bytes = 4 GB

So, 4 GB of GPU RAM for full precision? No, wait!! there's more!

Additional components needed to train 1 B parameters:

So, the total memory allocation to train a model with 1 billion parameters will be 24 GB of GPU memory at 32-bit full precision, which is large and costly!


What can we do to reduce the memory needed for training? There are several methods to reduce model size and memory requirements. Let's explore a few:

Quantization:

In this method, we reduce the model size by lowering the storage precision from 32-bit floating-point numbers to 16-bit floating-point numbers.

Let us understand this in more detail. The data types generally used in models are:

  • FP32: 32-bit floating-point numbers | Range: -3 x 10^38 to +3 x 10^38
  • FP16: 16-bit floating-point numbers | Range: -65504 to 65504
  • INT8: 8-bit integer | Range: -128 to 127

Now, changing the variable type can reduce the memory needed but also decrease the precision, which might reduce model performance. Let me explain with a simple example:

Value of Pi:

  • FP32: Pi (π) = 3.1415927
  • INT8: Pi (π) = 3

So, to understand how much memory is saved, let us summarize Quantization:

So, how much does the model size reduce for 1 billion parameters:

  • 4 GB @ 32-bit Full Precision
  • 2 GB @ 16-bit Half Precision
  • 1 GB @ 8-bit Precision

Currently, another data type, BFLOAT16, developed by Google Brain Research, has become more popular in deep learning. Many LLMs, like FLAN-T5, have been trained with BFLOAT16. It is also supported by newer GPUs like the NVIDIA A100. BFLOAT16 is a custom 16-bit floating-point format for machine learning that is composed of one sign bit, eight exponent bits, and seven mantissa bits. The advantage of BFLOAT16 is that the dynamic range of BFLOAT16 and float32 are equivalent. However, BFLOAT16 takes up half the memory space.

We have discussed how we can reduce the size of the models using quantization. Now, let us look at how to train such large models efficiently.

Efficient Multi-GPU Training Strategies

To efficiently scale across GPUs, we can use the following techniques:

Distributed Data Parallel (DDP)

This is a popular and efficient technique for training across GPUs, which is a part of PyTorch. Here's how it works:

  • DDP copies the model to each GPU and sends batches of data to each GPU in parallel.
  • Each GPU processes the data in parallel, and then a synchronization process combines the results from each GPU, thereby updating the model on each GPU.
  • This process allows for parallel computation across GPUs, leading to faster training.

Fully Sharded Data Parallel (Model Sharding)

A popular implementation of data sharding is PyTorch’s ‘Fully Sharded Data Parallel’ (FSDP). The FSDP technique is inspired by a method proposed in a paper by Microsoft called ZeRO, which stands for Zero Redundancy Optimizer.

The goal of ZeRO is to distribute the model states across GPUs with zero data overlap. This allows the model to scale across GPUs even when it does not fit on a single chip.

ZeRO is more efficient than DDP because, in DDP, the model needs to be replicated across GPUs, causing redundancy and consuming more space. In contrast, ZeRO performs model sharding of parameters and weights, thereby minimizing overlap.


I hope this blog gave you a good insight into the computation, storage, and training of Large Language Models, as well as efficient GPU handling. If you would like to understand more about how large language models are, do visit my blog:

In the next blog in this series, we will dive deeper into fine-tuning.

Thank you for following my blog, and stay tuned for more informative content. If you would like to engage in a discussion on the use of Generative AI in industrial applications like Manufacturing and Finance, feel free to connect with me. I have developed some great solutions at InsightAI that provide deep insights from your data. Connect with me at [email protected] .


#LargeLanguageModels #ArtificialIntelligence #MachineLearning #TextGeneration #AIInsights #GenerativeAI #TechInnovation #DataScience #NLP (Natural Language Processing) #AITechnology

Jitendra Rathore

CTO at Trident Spark ??

3 个月

Akash, thanks for sharing!

回复
Shaktesh Pandey

Co-Organizer @GDG On Campus GNIOT | Flutter Enthusiast l Gen AI Enthusiast l Cloud Enthusiast |

6 个月

Very informative

要查看或添加评论,请登录