How I Improved the Performance of my Computer Vision Model Two-Fold
Python is great for training deep learning models. The variety of supported platforms make it easy for pretty much anyone to train their own custom neural network.
But how about when it comes time for inferencing? When it comes time for deploying the machine learning model, whether in a web app with the deep learning model running in the backend (e.g. Cough Symptom Analysis Web App https://cospect.konect-co.com/) or an EDGE device, it's not only the performance of the model that matters but also the speed, especially when the hardware is a limitation.
From a quick glance at the Tensorflow Model garden below, it's clear that accuracy isn't a limiting factor when it comes to Object Detection. (Source: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md)
But in a time-sensitive or resource-constrained environment, such as a real-time model enabling self-driving cars to "see" the world, you certainly don't want a model with a latency of 5 seconds, as quick reaction time is necessary on the road.
So, let's dig deep into the problem. Here's where most deep learning tutorials stop:
Ok, great. You can see your Object Classification model in action and the fact that you are now an AI Developer brings a smile to your face. But...let's see, how long does it take to run on your machine in Python? Let's have a look.
Baseline: Running a training graph
On an "Intel(R) Xeon(R) CPU @ 2.30GHz" on Google Colab, the model takes 81 ms to execute! This model is certainly not fit to be used for inferencing in time-sensitive environments. And what about the reported 30ms latency you saw advertised? Wherefore this drastic reduction in speed?
The main explanation is that the current graph is a TRAINING graph. It isn't optimized for execution. Some of the features that make it unfit are that its weights are still treated as "variables" by TensorFlow rather than constants, slowing down operations. Secondly, redundant training operations such as "Identity" are present in the training graph but serve no purpose in inferencing.
What is the solution? How can we get our model to a point where it's suitable for inferencing and can be deployed?
Solution 1: Freezing the Graph
One quick fix is to freeze the graph, meaning to convert all variables to constant tensors. The function to accomplish this is convert_variables_to_constants_v2, present in tensorflow.python.framework.convert_to_constants.
Have a look at the improvement below! We went from ~80 to ~45 milliseconds. That's over a 40% increase in speed just from converting variables to constants.
Solution 2: Performing Graph Optimizations
Following this trivial step of converting variables to constants, another useful step is to perform optimizations on the model graph. Tensorflow has a set of optimizers https://www.tensorflow.org/guide/graph_optimization to improve the quality of the graph across a set of metrics. Examples of these metrics include power consumed, memory consumed, and time taken.
There are also some application-specific optimizations. If you want to drastically reduce the model memory with a slight reduction in accuracy, you might want to look into quantization (https://www.tensorflow.org/lite/performance/post_training_quantization). However, if you are using your model in a setting where accuracy is more important than time or memory constraints, such as a Kaggle data science competition, then quantization would not be beneficial.
It's important to analyze which types of optimizations would be in the best interest for your model: accuracy? latency? memory consumption? power consumption? (last two are especially important for low-memory devices like microcontrollers)
In fact, if you are targeting your machine learning model for microcontrollers, you might want to look into a compiler specific to low-power, low-memory devices, such as DeepC (https://github.com/ai-techsystems/deepC).
Solution 3: Using C API (or C++ wrapper)
One shortcoming of Python is that it's an interpreted language, so no intermediary executable is generated. Rather, all the code is interpreted on the fly. For this reason, Python programs are rather slow and the interpreter has a high memory consumption.
If you're looking to interface your C++ project with a machine learning model and process the output directly as vectors, taking advantage of Tensorflow's C API may be the right choice (https://www.tensorflow.org/install/lang_c).
For C++ projects specifically, Cppflow (https://github.com/serizba/cppflow) is a simple and elegant solution that provides an C++ interface with the C API. Have a look how easy it is to make a tensor, load in the data, and run the model.
auto x = new Tensor(model, "x"); // making a new input tensor
x->set_data(input_data, {1, 224, 224, 3}); // loading input data in the tensor
model.run({x}, {num_detections, detection_boxes, detection_classes, detection_scores}); // Running the model
Let's have a quick look at the performance of this model.
Adjusting for the speed difference between Google Colab and my local machine, using the Cppflow C++ API does run about 3x slower than Python implementation, which is important to be wary of.
To recap, it's important to optimize your deep learning model to apply it in the production settings. Training is not the last step. Graph optimizations (with freezing variables to constants being the simplest of them) enable you to target your model for the appropriate production setting, which depends on the scenario in which it is used and hardware it's running on.
All of the source code mentioned in this article is available at https://github.com/SRavit1/BoostingModelSpeed. Thanks for reading!