What is needed to run Machine Learning at Scale ?
Niwratti Kasture
Artificial Intelligence & Automation Practice Leader | Professional Services | Digital Transformation
Deploying machine learning models at scale is one of the most pressing challenges faced by team of ML experts and this problem becomes even harder to crack with increasing complexity of models. The central issue in all of the machine learning is "how do we extrapolate leanings from a finite amount of available data to all possible inputs of the same kind?" Training data is always finite, yet the model is supposed to learn everything about the task at hand from it and perform well on unseen data from real world.
Machine learning as a field is evolving with various experiments, researches and trials on variety of data sets leading to new techniques and models being built. The required number of trainable parameters, hyper-parameters, number of layers in DL models etc.. are continuously increasing and as a result the underlying data, resource and infrastructure requirement is proportionately growing.
A sample comparison of DL models (CNN- Convolutional neural network) from Imagenet challenge shows Sharp decrease in error rate (28% to 3.5% between 2010 to 2015) with advancement in CNN models; At the same time model complexity also increased significantly.
Above depiction is just a sample comparison of CNN models, similar situation have occurred across different models where the model complexity increases to solve more complex and custom problems. It also meant increase in effort, time and cost for building models. Though, it is important to remember that model building is just one of the many steps in the entire ML pipeline.
For large scale deployment, the scope widens and each of the individual stages of ML project life cycle have to be carefully executed. These aspects are not just limited to the tools that are used in these stages but also the methods that are applied at individual stages such as feature engineering, training, performance evaluation, and launching a prediction model. Some of the key aspects to focus on are -
1.Model selection - Artificial intelligence as a field offers several models (in machine learning and Deep learning) to solve same problem. Which one to select to solve the problem in hand in most optimal way ? The optimal solution is a function of multiple factors i.e. cost, time, resources, data, model complexity etc..While selecting the optimal prediction models, there are many factors that has to be considered.
1.1 Simplicity - If you are in dilemma, always choose simpler model. But how to decide which model is simple out of all available options. Here are some tips.
Simple model :
- is usually more generic than a complex model. This becomes important because generic models are bound to perform better on unseen datasets.
- requires less training data points. This is crucial factor for those problem statements where data points are limited. For ex. X-rays for a specific disease
- is more robust and does not change significantly if the training data points undergo small changes.
- may make more errors in the training phase but it is bound to outperform complex models when it sees new data.
1.2 Selection of optimal model for the problem in hand - While there are many ways (models) to solve the same problem. For ex. in case of a classification problem, you have variety of of options such as Logistic Regression, Naive Bayes, Decision Trees, XGBoost etc.. and each one of them have unique capabilities. You need to find out the most optimal model which is meeting the performance requirement with comparatively lesser efforts in terms of hyper-parameter tuning, execution time, performance outcome etc..
Sometimes what a sword failed to do, Needle can.
1.3 Flexible to new techniques - Machine learning model development goes through an iterative process with several experiments. It is necessary to ensure that selected model is flexible enough to augment new techniques to address any unique requirements. for ex. In Deep Learning, the concept of transfer learning can be applied with variety of available pre-trained models - https://keras.io/applications/. It is important to note that each of these models have varied model size, different number of trainable parameters etc. With experiments you will be able to find out the optimal model for your specific use case. for ex. VGGnet16 or 19 is very heavy in terms of model size and number of parameters compared to Mobilenet. It may not be suitable in all the scenarios to use VGGnet even if it is one of the most commonly used neural network architecture.
2. Training speed and managing large data sets - It is a known fact that training time for a model is a function of the training data that is supplied and the capability of underlying infrastructure. In such situations, it becomes important that data ingestion is planned in a way which doesn't overload the GPU or CPU (which ever is used) or doesn't remain under utilized. There are functions available such as Keras in-built generator function (or a custom build generator) which feeds the data in mini-batches to the model to maintain continuous supply of data.
keras.preprocessing.image.ImageDataGenerator(featurewise_center=False, samplewise_center=False, featurewise_std_normalization=False, samplewise_std_normalization=False, zca_whitening=False, zca_epsilon=1e-06, rotation_range=0, width_shift_range=0.0, height_shift_range=0.0, brightness_range=None, shear_range=0.0, zoom_range=0.0, channel_shift_range=0.0, fill_mode='nearest', cval=0.0, horizontal_flip=False, vertical_flip=False, rescale=None, preprocessing_function=None, data_format='channels_last', validation_split=0.0, interpolation_order=1, dtype='float32')
3. Feature Engineering - A vast topic in itself. Feature engineering is an important iterative process as part of model building. There are several situations when you don't have enough data related to problem statement or available data is noisy, unstructured, have anomalies. In such cases it becomes important to prepare the data through various techniques such as data augmentation, data transformation through variable conversion, deriving new variables, modifying data format, scaling, binning, grouping, one hot encoding etc..
4.Ease in terms of training, debugging, evaluation and deployment - Model building is just a module in the entire machine learning pipeline. The entire cycle requires a necessary eco-system to support the training, debugging, log review and monitoring of the model. For huge data set, it has become natural choice to execute the entire machine learning cycle on cloud platforms such as AWS sagemaker and associated services which provides the required eco-system (GPUs, log tracking, alerting and model performance monitoring etc..)
5.Curse of Over-fitting - For large scale deployments, it is necessary to ensure that your model is not cursed with the problem of Over-fitting. It is a phenomenon where a model becomes too specific to the data it is trained on and fails to generalize to other unseen data points in real world. A model that has become too specific to a training dataset and has actually ‘learnt’ not just the hidden patterns in the data but also the noise and the inconsistencies in the data. To identify if your model is cursed with Over-fitting problem, compare the model performance on training and test data. For such cases, the model will fail miserably on the test data and it will show sharp difference in the results between training and test data.
6.Model performance measures - Each ML model is designed to solve a problem or to meet specific goal. It is obvious that model performance needs to be measured with relevant KPIs. Generally in the beginning of model building the focus is on having the best model (in terms of prediction capability or accuracy) and it often outweighs any other KPI considerations. After all, among the many choices of machine learning or deep learning models (for ex. Decision tree, random forest, XGBoost, ANN, CNN and many more) the idea is to select the best one which satisfies the prediction requirements. However, as the model performance matures, ML framework’s ease-of-use, scalability, cost of ML model deployment and extendibility become much more important performance parameters. This is a shift in terms of selecting relevant KPIs during model deployment and usage cycle.
7.Business performance measures - As part of a ML project, the model performance must be evaluated with relevant measures as explained above. However, in addition to this, it is also important to validate the model performance against the expected business outcome. It happens more often when the model building team has come up with a fantastic model with high performance results but it doesn't still meet business demand. To avoid such cases, relevant methods and measures should be used to validate model's performance against business requirement. for ex. e-commerce sites, social media platforms (Facebook, Twitter) use A/B testing as a method to capture relevant metric. Twitter uses the A/B testing outcome to determine how the new model is helping to improve usage, user time to stay on twitter site etc.. Digital experience optimization platform such as VWO uses similar approach to help the e-commerce portals to place product ads at relevant place on the web page to improve sales.
8.Quality and speed of predictions - In case of retail, e-commerce, web search, social media etc.. the speed and quality of content is pivotal, it becomes necessary for the machine learning model to provide faster and accurate predictions after deployment. With expectations to have real time response on mobile applications, it is necessary that outcome speed and scalability aspect is considered while dealing with large data set. To achieve higher accuracy, speed of execution, the underlying infrastructure and resources need to be tuned and allocated in a optimal way (with auto scaling, required GPU configuration etc..)
9.Maintainability - Post deployment maintenance is a key factor while deciding the best model for deployment. Maintenance of a model post deployment requires model performance monitoring, availability, tracking of model outcome logs etc.. There are real skill constraints within organizations where functioning, characteristics of a model are known to only limited set of people. It is not a good situation to be in if you are planning to deploy machine learning at scale and requires up-skilling.
In nutshell, machine learning cycle from problem statement definition to model deployment and maintenance involves multiple elements and you need to tune the knobs of these various components to get an optimal model that runs at scale.