NVIDIA Mixed Precision & Power Consumption - Part 1
Andrew Antonopoulos
Senior Solutions Architect at Sony Professional Solutions Europe
Deep Learning has enabled progress in many different applications and can be used for developing models for classification and regression implementations.
Larger models usually require more computing and memory resources to train, and modern deep-learning training systems use a single-precision (FP32) format.
The IEEE Standard for Floating-Point Arithmetic is the common convention for representing numbers in binary on computers. In double-precision format, each number takes up 64 bits. Single-precision format uses 32 bits, while half-precision is just 16 bits.
In single-precision, 32-bit format, one bit is used to tell whether the number is positive or negative. Eight bits are reserved for the exponent, which (because it’s binary) is 2 raised to some power. The remaining 23 bits are used to represent the digits that make up the number, called the significand.
Double precision instead reserves 11 bits for the exponent and 52 bits for the significand, dramatically expanding the range and size of numbers it can represent. Half precision takes an even smaller slice of the pie, with just five for bits for the exponent and 10 for the significand.
The following image visualises the above information:
and if we want to represent PI by using the precision levels, will look like this:
In the following paper, Nvidia introduces a methodology for training deep neural networks using half-precision floating point numbers without losing model accuracy or modifying hyper-parameters (which is a time-consuming process). The paper suggests that this method nearly halves memory requirements and speeds up arithmetic on recent GPUs. Weights, activations, and gradients are stored in IEEE half-precision format.
Nvidia paper: https://arxiv.org/pdf/1710.03740
According to Nvidia, the performance (speed) of any program, including neural network training and inference, is limited by one of three factors:
Reduced precision addresses two of these limiters. Memory bandwidth pressure is lowered by using fewer bits to store the same number of values. Arithmetic time can also be lowered on processors that offer higher throughput for reduced precision math. For example, half-precision math throughput in recent GPUs is 2× to 8× higher than for single-precision. In addition to speed improvements, reduced precision formats also reduce the amount of memory required for training.
Mixed precision uses 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally in terms of evaluation metrics such as accuracy.
Most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each taking 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialised hardware to run 16-bit computations, and 16-bit dtypes can be read from memory faster.
Nvidia GPUs can run operations in float16 faster than in float32 and TPUs. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality.
Implementation
To implement mixed precision, you will need to use the Keras mixed precision API, which allows you to use a mix of either float16 or bfloat16 with float32 to get the performance benefits from float16/bfloat16 and the numeric stability benefits from float32.
Initially will need to use the appropriate libraries:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import mixed_precision
Mixed precision will only speed up models on recent NVIDIA GPUs, Cloud TPUs, and Intel CPUs. NVIDIA GPUs use a mix of float16 and float32, while TPUs and Intel CPUs support a mix of bfloat16 and float32.
Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units called Tensor Cores to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision. However, memory and bandwidth savings can enable some speedups.
You can check your GPU type with the following command, which exists if the NVIDIA drivers are installed.
nvidia-smi -L
and the output will be similar to this:
GPU 0: NVIDIA GeForce RTX 4060 Ti (UUID: <UUID number>)
To use mixed precision in Keras, will need to create a tf.keras.mixed_precision.Policy, typically referred to as a dtype policy.
Dtype policies specify the dtypes layers that will run. Will need to construct a policy from the string 'mixed_float16' and set it as the global policy. This will cause subsequent layers to use mixed precision with a mix of float16 and float32.
# Policy for mixed precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
The policy specifies two important aspects of a layer: the dtype the layer's computations are done in and the dtype of a layer's variables. With this policy, layers use float16 computations and float32 variables. Computations are done in float16 for performance, but variables must be kept in float32 for numeric stability.
Validation
To validate the mixed precision will need to print out the dtype policy by using this code:
领英推荐
# Print out the dtype policy for compute and variables
print('Compute dtype: %s' % policy.compute_dtype)
print('Variable dtype: %s' % policy.variable_dtype)
and the output will be similar to this:
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9
Compute dtype: float16
Variable dtype: float32
As you can see from the above output, the GPU has a compute capability of 8.9, and the policy has been set up successfully.
Testing & Power Consumption
After reading the Nvidia paper, the question was raised: Will this provide any benefit during the ML training, and will it reduce the hardware's carbon footprint?
Calculating the carbon footprint will require 4 steps:
The Dataset that was used during the test was images of bird species, which included 525 bird species, with 84635 training images, 2625 test images (5 images per species), and 2625 validation images (5 images per species).
Four tests were completed by using different hyper-parameters and mixed precision as a floating point. The ML model configuration for each test was the following:
Benchmarking
1st Experiment
2nd Experiment
3rd Experiment
The GPU power consumption, utilisation and overall power consumption across all the tests can be seen in the following image:
and a graphical presentation of the above tests can be seen in the following graph:
Overall, the test results confirmed that using mixed precision for a classification model will reduce power consumption but require adjusting the hyper-parameters. The 3rd experiment used more neurons, which forced the GPU to work harder but kept the power consumption at a low level, which is close to the 2nd experiment, which used fewer neurons.
Mixed precision is a great option for training models, specifically when using Nvidia GPUs.
Check Part 2 for more information about loss and accuracy