How to Deploy a Machine Learning model to production using Microsoft Azure
In this post, let's discuss on how to deploy a machine learning model to production using Azure Machine Learning Studio. You don't need to have Azure subscription unless you want to store lots of data (>10GB). For huge data sets, you might need to choose the paid standard plan. We will be testing our web-service using free tier however the production web API endpoint is only available for a paid plan. In this example, we will be creating a model from scratch, train and deploy it to production. So, let's begin.
- The first step is to register for an account at Azure Machine Learning Studio. Choose "free" plan and register for a new account. To know about "free" account restrictions, read here. Navigate to Azure studio home page.
- Create a "New Blank Experiment". I have already created an experiment, so the view panel may seem different when you do it for the first time.
Once you have created, navigate to "Settings" on left side panel and change your workspace name if you want.
To demonstrate this, I have used a sample dataset from "titanic survival". We will discuss a binary classification example here. You can use a different dataset on your own and follow the instructions in this post.
3. Now we can start with data pre-processing. We will need to analyze the data in order to understand categoric/non-categoric variables, data types, missing values and the data distributions. We will have more steps to follow if we have huge data with lots of noises and imbalances. Note that, we're using a pretty small dataset for simple illustration. Let's get started by uploading the dataset to Azure Studio. Click on "Dataset" from left side panel and then click on "New".
Upload the dataset from your drive and go back to the experiment. Go to Saved Datasets -> My Datasets. Choose your dataset, drag and drop it to your experiment window.
If your data is on the cloud, then go to Data Input and Output -> Import Data and fill up the details.
4. Now, let's analyze the data before proceeding any further. Let's analyze the noise and anomalies present in the data. We need to remove or adjust them so that we get the optimal model performance. Right click on the dataset and click on "visualize".
As you can see there are 1309 data records with 12 features. Let's perform deeper analysis on this data by performing statistical operations. Drag and drop Statistical Functions -> Summarize Data and connect the endpoints and then click on Run to execute your experiment.
Once your experiment is executed, visualize the data as shown below. Unless you executed the experiment, data visualization won't be visible.
Data visualization in Azure is very powerful. You may be deploying your app to another platform, but atleast you can make use of this data visualization and save time writing pre-processing code in your application. You will have some observations once you analyze the data. We will be performing data pre-processing based on that.
Here are the observations as per the visualization:
- Columns that are categorical: pclass, survived, sex, sibsp, parch, embarked
- Columns that are non-categorical and to be removed: PassengerId, name, ticket, fare
- Columns with missing values: Age (263) , fare (1), cabin (1014), embarked (2)
There are 1014 missed cabin features in the dataset. Considering the dataset size, we could observe that majority of the samples doesn't have this feature. So we can remove cabin feature as well. Also from the dataset, we can observe that cabin feature has multiple redundant entries. For the missed age feature, we could fill the records by their mean. Now let's apply above mentioned changes to our dataset. You can remove the Suggest Data module from the experiment.
5. Add Edit Metadata module into your experiment and launch column selector.
We have identified categorical variables from our earlier analysis. Include them here.
Now we need to remove unwanted non-categoric variables and variables that are most redundant. We found earlier that PassengerId, name, ticket, fare and cabin can be removed. So, let's remove them by pulling another module Select Columns in Dataset to our experiment window and launch column selector to specify variables to be removed.
Add Clean Missing Data module and specify features to be handled. On the right-side panel, you can mention how you want to replace the missing values and mention the variables on column selector.
For the variable embarked, we just need to remove the row of respective missing value.
Now we need to specify the output variable(s) in the dataset.
Although in this example, we don't have large variant magnitude of data, let's go ahead and normalize the data for better consistency. If variable magnitudes are largely varying each other, then normalization is a must otherwise learning algorithm will give more weight to the variables with high magnitude. It might also result in over-fitting. Let's do a min-max normalization on the data as below.
Now let's split the data for training and validation. Drag and drop Split Data module. Specify the split ratio and other parameters as required.
Add Train model module and specify the label column. Also select the classification algorithm of your choice and add them on your experiment. In this example, I have selected classic feed forward network model and connected to Train module in the experiment.
Specify your neural network parameters such no.of hidden nodes, learning rate, initial weight, number of epochs etc. By default it assumes fully connected layer, but you can also have your own hidden layer specification by adding your own script into the experiment.
Now add Score model module, connect them from Train module as well as Split data module. Why because we have split the data earlier and the rest of the data (validation) is needed to check the model efficiency. Click Run to train the model and to visualizae validation results.
From the visualization, we can understand how efficient we are in predicting the survival. The prediction values are indicating class 1 (survived) probability. For example, first score says, the person has survival chance of 11.57% while the fourth score says the person has survival chance of 80.48%. Conventionally, we can assume all probabilities that are above 50% are survived.
We can also see the model accuracy by adding an Evaluation module, run the experiment followed by the visualization.
Now that we are happy validating our model for confusion matrix and accuracy metrics. How do we deploy this model as a webservice to production? Drag and drop WebService module and connect input/output to respective modules. We need to add a handle for servicing inputs from webservice end point. For that, add Select Columns in Dataset to intercept the inputs from webservice and filter out the output label if present.
Now run this experiment and click on Deploy Webservices from the options listed on bottom. Click on Test on the following window.
Enter your webservice input and submit them.
You can view the webservice response on the bottom of the window. 12.5% chance of survival.
Congrats! Finally, we have deployed a machine learning model to production and we can re-use it's end-point in our applications to fetch the prediction results! Thank you for taking your time reading through this post. If you ever wonder how complex to build a similar application in Java, you can read here on how to build them using DL4J and achieve good model performance.
Senior Scientist building Microsoft Designer | LLM | RL | Computer Vision
6 年Nice !