A Primer on Nvidia-Docker — Where Containers Meet GPUs
GPUs are critical for training deep learning models and neural networks. Though it may not be needed for simple models based on linear regression and logistic regression, complex models designed around convolutional neural networks (CNNs) and recurrent neural networks heavily rely on GPUs. Especially computer vision-related models based on frameworks such as Caffe2 and TensorFlow have a dependency on GPU.
In supervised machine learning, a set of features and labels are used to train a model. Deep learning algorithms don’t even need explicit features to evolve trained models. They pretty much “learn” from existing datasets designated for training, testing, and evaluation.
Neural networks perform complex computation on tens of thousands of matrices before the final model is evolved. When an image is fed to a CNN, it gets translated into a matrix of real numbers. Depending on the density and size of the image, multiple such matrices are generated by the neural network. These matrices are added and multiplied with other matrices during the forward propagation and backward propagation till appropriate weights are derived.
A trained model can be run on CPUs for inference. Since inference is not as intense as training, GPUs are strictly optional when running models for inference.
CPUs are not designed to deal with such a rapid rate of computation. While they are faster for performing regular number crunching, CPUs are not meant for parallelizing mathematical operations. That’s where GPU plays a crucial role. They may not have the horsepower of CPUs, but they can perform massively parallelized calculations.
Read the entire article at The New Stack
Janakiram MSV is an analyst, advisor, and architect. Follow him on Twitter, Facebook and LinkedIn.
EMEA Partner Solutions Architecture
7 年Thanks for the article, very didactic. I recommend you check out AWS Sagemaker. Beyond the already optimized algorithms it provides, it allows you to upload your Docker images with your ML models written in your favorite framework and let Sagemaker perform the training in a distributed cluster of any size freeing the Data Scientists from having to manage the infrastructure used for training - I.e. a Kubernetes cluster or others. If you want to train on GPU enabled instances like the P3 instances which provides up to 8 NVIDIA Tesla V100 GPUs in a single instance, you just need to include the CUDA toolkit in the Docker image. The service is GA, it’s already been used in production by many customers and provides other goodies like managed Jupyter notebooks with Anaconda, Tensorflow, MXNET and Spark preinstalled and let you put your models in production exposing them via API and again provisioning the infrastructure required for inference.