Lessons learnt while building a Machine Learning Pipeline

Mohit Tare

Lead ML Engineer

发布日期: 2019年1月13日

After couple of weeks of hard work and tuning tug of war with data, I told myself "Ok now we got a decent accuracy of over 95% from this tuned model. We are now all set to take this model live to derive the real value in business!"

The next day meeting happened and I got the reply - "Ok Mohit. That’s cool work, but we need to be able to integrate it with our product so that users could benefit from it".

The next food for thought was "How do I transfer my Jupyter notebook/prototype into a scalable and modular application that can be easily integrated with the existing setup of applications ?"

The above situation is extremely common specifically in case where lot of applications (or services) interact with each other to form a platform/ product. There are lot of architectural decisions to be taken many of which rely on the existing setup, flow of data, hardware, current technology stack etc. At many times these decisions need to be taken with help of software architects where our role as Data Scientists/ML engineer plays a crucial role. So it's better if we have at least some idea about what should be the possible approaches.

The following article present some of my learning in building a ML pipeline .

Journey from a MVP to a Product

Solving a Data Science Problem is like going on an adventure. We usually have to do a lot of exploration and experimentation before coming to a final solution. What matters here is that we try a lot of ideas and fail fast ! The journey from an experiment to a product is very well explained in the below talk by @Dat Tran and his team is Idealo.

Let's start with the basics, on what can be architecture of the application.

1.Naive Approach - Hey you know what, Jupyter notebook has a functionality available where you can just export the notebook into a Python file. Then I will just run the python file using the python executable. Not a wise decision ! Why you might ask ?

The code is not modular/scalable
In case you wish to change the features/ tune any hyper-parameters you need to go back and change the code
Training a model is different activity than serving the users with the prediction, so it doesn’t make sense to repeat the entire activity to just find new predictions

2. Model weights save approach - Another approach can be saving the weights of the trained model into a DB/file/data store and then load up the coefficients/weight to predict.

# with X_train, X_test, Y_train, Y_test 
from sklearn.linear_model import LogisticRegression 

clf = LogisticRegression() 
clf.fit(X_train, Y_train) 
print clf.score(X_test, Y_test) 

import json 
with open('logreg_coefs', 'w') as f: 
     json.dump(clf.coef_.tolist(), f)

Here we save the coefficients to a JSON file, which we can then load in the prediction stage separately and predict.

In that case we have the training activity and prediction activity separated and both the activities can be run on a separate servers. In short training activity can be separated from prediction with minimum inter dependency (Oh I said minimum not no dependency !)
However in case of most Machine learning problems you are going to solve , you need to do the feature engineering, clean the data, do standardization. Usually this needs to be done for both training as well as prediction phase. This usually means code redundancy that you need for both the machines/servers. Also don’t forget the overhead of managing release headache even if you just modify/add/drop one feature
Also this method won't be so simple to implement as you move towards the complex models like tree based ensembles. Yes I am taking about the secret sauces that we guys use to boost up the accuracy, (Random Forests, xgboost etc). In that case since the weights cannot be stored simply with just a single line of code, and model can't be reconstructed easily, this method won't work.

3. PMML approach - So we need a method that helps us describe & export the pre-processing steps along with the model. The PMML standard provides way to represent model into a PMML File. There is wrapper library built around sklearn (sklearn2pmml) that can easily convert sklearn model to PMML file that can be later used in the downstream processes

import pandas 

Import sklearn2pmml 

iris_df = pandas.read_csv("Iris.csv") 

from sklearn.tree import DecisionTreeClassifier 
from sklearn2pmml.pipeline import PMMLPipeline 

iris_pipeline = PMMLPipeline([ 
         ("classifier", DecisionTreeClassifier()) 
]) 
iris_pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"]) 

from sklearn2pmml import sklearn2pmml 

sklearn2pmml(iris_pipeline, "DecisionTreeIris.pmml", with_repr = True)

This will save the transformations along with the model information and hence avoids redundancy of code/operations
However not all latest/complex models and transformations are supported by PMML, so in that case we need a custom approach for these models

4. Custom Framework - In an ideal scenario we want a kind of flexible custom approach, to which if we just provide the input it should be able to give the prediction output. That is where serialization can help. We need to ideally serialize the model and data preparation part and just treat it as a black box in the prediction layer. For this we can design the framework as 2 separate processes(services) which is training and prediction. Both the processes should be able to run independently and end goal should be a black box that accepts input and gives predicted output.

There are lots of good project structures and existing artifacts for this. Couple of Popular ones are

Cookie Cutter Data Science - This a good starting point which specifies entire data science project structure specifically if couple of Data Scientists/Engineers are working on different modules
Satalia Production Data Science – A nice tutorial which contains workflow for collaborative data science aimed at taking everything to production

I have also created a bare bone repository taking few of ideas above and my previous work experiences. Below is the short diagram that describes the approach. It contains sample of various modules and can be extended as per need.

What is your experience in taking the machine learning model you built live to the users ? Feel free to share your views and opinions.

Geetha Gopakumar Nair

Head , Data Science, Computational Science and Engineering for Energy (CSEE) at Halliburton

5 年

Great article on going beyond the POC. I tried for one of the project in R where I save the preprocessing and training files in .rda format. “Later I load them in powerBI where it will prompt to open the test file and in dashboard one can visualise the result and it’s accuracy.