The Difference Between Deep Learning Training and Inference
During my time as head of AI products and AI strategy at Intel, I developed a blog series to help explain the basics of AI. After I left Intel in May 2021, my "AI 101" posts got re-attributed to someone named "MaryT_Intel, Community Manager" :) so I decided to re-post them here and to update them periodically as necessary. The post below originally appeared on the Intel AI blog in June 2020.
My last “AI 101” post covered the difference between artificial intelligence, machine learning, and deep learning. In this post, I’ll cover deep learning training and inference -- two key processes associated with developing and using AI.
Training: Creating the deep learning model
In the last post, I explained that deep learning (DL) is a special type of machine learning that involves a deep neural network (DNN) composed of many layers of interconnected artificial neurons. Training is the process of “teaching” a DNN to perform a desired AI task (such as image classification or converting speech into text) by feeding it data, resulting in a trained deep learning model.
During the training process, known data is fed to the DNN, and the DNN makes a prediction about what the data represents. Any error in the prediction is used to update the strength of the connections between the artificial neurons. As the training process continues, the connections are further adjusted until the DNN is making predictions with sufficient accuracy.
As an example, consider training a DNN designed to identify an image as being one of three different categories – a person, a bicycle, or a strawberry (Figure 1).
A data scientist has previously assembled a training data set consisting of thousands of images, with each one labeled as being a person, bicycle, or strawberry. During the training process, as each image is passed to the DNN, the DNN makes a prediction (or “inference”) about what the image represents. In Figure 1, the DNN predicts that one image of a bicycle is a strawberry, which, of course, is incorrect. As this error propagates back through the DNN in a “backward pass”, the weights (i.e., the strength of the interconnections between the artificial neurons) are updated to correct for the error so that the next time the same image is presented to the DNN, the DNN will be more likely to correctly predict that it’s a bicycle.
This training process continues -- with the images being fed to the DNN and the weights being updated to correct for errors, over and over again -- until the DNN is making predictions with the desired accuracy. At this point, the DNN is considered “trained” and the resulting model is ready to be used to make predictions against never-before-seen images.
A compute-intensive process often run in a data center
During the DL training process, the data scientist is trying to guide the DNN model to converge and achieve a desired accuracy. This requires running dozens or perhaps hundreds of experiments, trying different DNN designs and adjusting other parameters. Each experiment can require computations on the order of “exaflops” (or a billion billion operations), which can take hours or days to complete. To speed up what can be a very lengthy process, the data scientist will often train DNNs in a data center, where processor power can be scaled out relatively easily by adding more servers.
Inference: Using the deep learning model
Deep learning inference is the process of using a trained DNN model to make predictions against previously unseen data. As explained above, the DL training process actually involves inference, because each time an image is fed into the DNN during training, the DNN attempts to classify it. Given this, deploying a trained DNN for inference can be trivial. You could, for example, simply make a copy of a trained DNN and start using it “as is” for inference. However, for reasons explained below, a trained DNN is often modified and simplified before being deployed for inference.
领英推荐
Optimizing a DNN to meet real-world power and performance requirements
DNN models created for image classification, natural language processing, and other AI jobs can be large and complex, with dozens or hundreds of layers of artificial neurons and millions or billions of weights connecting them. The larger the DNN, the more compute, memory, and energy is consumed to run it, and the longer will be the response time (or “latency”) from when you input data to the DNN until you receive a result.
But sometimes the use case requires that inference run very fast or at very low power. For example, a self-driving car must be able to detect and respond within milliseconds in order to avoid an accident. And a battery-operated drone designed to follow a target or land in your hand has to be power-efficient to maximize flight time. In such cases, there is a desire to simplify the DNN after training in order to reduce power and latency, even if this simplification results in a slight reduction in prediction accuracy.
Even when there are not strict power or latency requirements in play, there may be a strong desire to simplify a DNN simply to save on energy costs. For example, large websites can easily spend millions each year just to supply power to the inference processors that enable them to auto-identify people in uploaded photos or to generate personalized news feeds for each user.
There are several ways to optimize a trained DNN in order to reduce power and latency. In one method, called “pruning”, the data scientist starts by observing the behavior of the artificial neurons across a wide array of inputs. If it turns out that a given group of artificial neurons rarely or never fires, then that group can likely be removed from the DNN without significantly degrading prediction accuracy, thus reducing the model size and improving latency. Another optimization method, called “quantization”, involves reducing the numerical precision of the weights from, say, 32-bit floating point numbers down to 8-bit, which results in a reduced model size and faster computation.
As a side note, there is an interesting parallel between optimization methods to improve inference performance and data compression techniques used to reduce the size of audio, image, or video files. Just as audio compression can reduce a file size by 80-95% with no perceptible difference in sound quality, inference optimization methods such as pruning and quantization can reduce latency by 75% with minimal reduction in prediction accuracy.
Example: AI-enabled traffic surveillance system
A complete end-to-end AI system covers both training and inference and can involve a range of AI processors of varying specifications. Figure 2 depicts an AI-enabled traffic surveillance system in which AI is used to auto-detect and identify vehicles entering an intersection in violation of traffic rules.
This AI system actually involves four different types of image recognition -- one to detect a car entering the intersection, a second to identify an image that contains a license plate, a third to localize the license plate in an image, and a fourth to automatically read the characters on the license plate. All four DNN models are trained and maintained by data scientists using high-power training processors in a data center. When each model is achieving its desired accuracy, the data scientist uses pruning and quantization to reduce the complexity of the model while maintaining sufficient accuracy, after which, the models are ready to be deployed for inference.
When the system is operating “live” in production, the “vehicle detection” model running alongside the traffic camera constantly processes a video feed to detect vehicles entering the intersection. If a vehicle enters when the light is red, multiple images of the vehicle are captured and fed into the “license plate detector” model, which finds an image that includes a license plate and transmits it to a gateway server for further processing. At the gateway server, a first inference is run to localize the license plate in the image, and a second inference is run to read the characters on the license plate. Finally, the license plate information is sent to the data center where an application looks up the car’s owner based on the license plate and queues up potential traffic violations to be reviewed and acted upon by a municipal employee.
Hopefully through this post you now have a good understanding of the difference between DL training and inference. Future posts will cover more AI basics and additional examples of end-to-end AI business applications.
Helping people translate complex technologies into successful business outcomes
8 个月It's an invaluable skill to explain complex technological topics in a simple and easy to understand manner - you achieved this in both of your excellent posts - well done.
Empowering IT, Product & Engineering Leaders: Elevate your leadership, amplify your impact, and secure the promotions you deserve without burnout. Discover LeaderSHIFT's transformative approach within weeks ?? | Speaker
1 年Appreciate your efforts, @Mark Robins, in demystifying these often interchanged yet distinct terms - AI, Machine Learning, and Deep Learning. Your explanation serves as a solid foundation for anyone interested in understanding the core of these technologies.
AI/ML CEO | Service Assistant
1 年Great post, Mark!
Brilliant! Clarity is always the sign of smarts, understanding and rubber hitting the road. Well done!!
AI Web/Digital Marketing | Growth @ Cohere | Ex-Nervana, Intel AI, AWS
1 年I remember getting these live, Mark Robins!! Happy to see them out there again. ?? Still very relevant!