How to Select the Right GPU Instance for Your Team on AWS?
Image Source : Dall E

How to Select the Right GPU Instance for Your Team on AWS?

Imagination is the exaggeration of the data you have in your brain. I want to train a diffusion model to compose a piece of music on a lovely evening in Amsterdam. But I require an AWS GPU instance to achieve this. We, the machine learning engineers, are always baffled about the optimal GPU instance on GPU. I completed a short study on this, and the outcome is that:

After reading the article hopefully, you can select the right GPU instance for your work.

The Decision Tree to Decide GPU

No alt text provided for this image

DrawIo Link

If you are doing HPC (High-Performance Job) like Drug Discovery or High Precision Job, then we suggest following the P (historically called?Performance-Heavy) Instance Family. Else we recommend following the Instance Family. I am providing the cost chart for the noted GPU instances.

P3 and P4 Instances Cost

No alt text provided for this image

G4 and G5 Instances Cost

No alt text provided for this image
This image is not related to the GPU instance - I got the picture from Dall-E, it's not an actual view. The concept was generated with the diffusion model.


Always Don't Select GPU on the Price Ground

Please don't select the GPU always as per the pricing basis. We executed a little experiment- We trained a?Scibert Transformer model with 100K data points.

The result on?a g4dn.2xlarge?machine:

No alt text provided for this image
The cost is : (0.752 * 311.25) / 3600  = $0.06        


?We executed on a g5.xlarge machine too. The result is on the same configuration :

No alt text provided for this image


The Cost is : (1.006 * 197.31) / 3600  = $0.05        


So if we use a g5.xlarge machine then we can save 20% of our budget.

And other benefits of a g5 family over a g4 family are:

  • NVIDIA Ampere Architecture is modern architecture, it supports all the precision formats.
  • We should follow the Mixed Precision Training in PyTorch-based project. There are different types of floating datatypes - FP32, FP16, TF32, BF16

source: NVIDIA Blog

source: NVIDIA Blog

We executed the apple-to-apple comparison with the fp16 datatype because tf32 and bf16 need the Ampere Architecture which is available under the g5 instances family.

PS: What are TF32 and BF16?

BF16

If you have access to an Ampere or newer hardware you can use bf16 for your training and evaluation. While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were experiencing overflow issues while training the model, bf16 will prevent this from happening most of the time. Remember that in fp16 the biggest number you can have is `65535` and any number above that will overflow. A bf16 number can be as large as `3.39e+38` (!) which is about the same as fp32 - because both have 8-bits used for the numerical range.

No alt text provided for this image


TF32

The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total.

It’s magical in the sense that you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput improvement.

When this is done CUDA will automatically switch to using tf32 instead of fp32 where it’s possible. This, of course, assumes that the used GPU is from the Ampere series.

Like all cases with reduced precision this may or may not be satisfactory for your needs, so you have to experiment and see. According to NVIDIA research the majority of machine learning training shouldn’t be impacted and showed the same perplexity and convergence as the fp32 training.

No alt text provided for this image

And I found the detailed AWS GPU Instance details for further reading:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Reference

Debesh Choudhury, PhD

Information Security Researcher, Academician, Entrepreneur | Password & Cybersecurity, Digital Identity, Biometrics Limit, 3D Education | Linux Trainer | Writer | Podcast Host

1 年

Raahul Dutta (He/Him) ??, Although I am not in favor of composing music with ML, I like your systematic approach and cost analyses for the utilization of AWS NVIDIA GPU resources.

回复

要查看或添加评论,请登录

Raahul Dutta (He/Him) ??的更多文章

  • Calculus isn't boring - Tensorflow Part2

    Calculus isn't boring - Tensorflow Part2

    Writing from Antarctica — We just shifted our house, my room name is ‘Antarctica.’ Robots may book an air ticket to…

  • Learn Basic Tensorflow Part 1

    Learn Basic Tensorflow Part 1

    Hola!!! Last year I searched for a proper tensorflow tutorial, but I could not find, It was scattered. Google’s…

    2 条评论
  • Try this new dating app

    Try this new dating app

    Although I am not in the dating market, but I have seen many of my friends swiping right for finding the special…

    4 条评论
  • Reduce Your Expenses

    Reduce Your Expenses

    One of the smartest ways to reduce your expenses is to use coupons. Use coupons whenever and wherever possible as you…

    3 条评论

社区洞察

其他会员也浏览了