登录查看更多内容

Power Of ML-Ops

Shubham Bhalala

Data Scientist @System2 | MS Data Science @Columbia University | Data Science, MLOps Engineer

发布日期: 2020年5月22日

What is MLOps?

In layman terms when we integrate the concepts of Devops ( continuous development & continuous integration) on ML for model and testing it's accuracy automated. What exactly we will look in this article is how you can automatically create your machine learning model. By automatic, I mean, you don't have to manually adjust the hyper-parameters. What are hyper-parameters? Hyper-parameters are those variable units which we need to give to our program to train model, Ex.

Number of Hidden layers.
Number of neurons in one particular layer.
What optimizer to use?
What learning rate to give?
The kernel size.
Which pooling method to use, and many more things like these!

So, one of the fact that more than 60% of the Machine Learning Program or Project in corporate world are not implemented. This is a article for proof HERE. Now the question is why is this so?

Because the changing of these hyper-parameter is very huge and tedious task, because while training with one set of specification it might take days to train model and at the end we if don't get the satisfying results we again change the hyper-parameter and again train it. This will again take few days. This process goes on until you are not satisfied with the outcomes. Hence at last either it's too late for the production or the accuracy doesn't match our requirement, eventually ends in shutting down of project. This might create a huge loss for companies and it's reputation.

So, this article is a solution to it with one great example, a real life tested example by me that how can you make this thing automated and yield great accuracy! ??

Lets first discuss the requirement of this project. Even you can also follow the steps to build your own.

Docker image for running CNN & ANN program.
Docker image for running Regression & Classification program.
Jenkins & Docker & Git-GitHub

So, lets talk about making the image for our requirement because we will be running our program on top of container hosted by docker. Here we will make one image using Dockerfile and other directly downloading from docker hub and modifying it according to our use case and even general use cases.

Docker image for Regression & Classification

FROM centos:latest


RUN yum install epel-release -y &&\
    yum update -y &&\
    yum install python36 -y &&\
    pip3 install sckit-learn &&\
    pip3 install numpy &&\
    pip3 install pandas &&\
    pip3 install matplotlib &&\
    pip3 install pillow &&\
    yum update -y

This Dockerfile will help you to create your image. Command to build it is:

docker build -f lrcpython .

This will create a image with name lrcpython-linear regression classification python. For this you should be present in the directory where Dockerfile is made, for me I am in the same directory so, I have used " . " . You will see this image comes up. For detail description about Dockerfile, feel free to view my previous articles.

Now, lets have one image for CNN & ANN, basically we need Tensorflow & keras. So to make our task easy and not using much resources of our own, we will download it from docker hub. The image name is tensorflow/tensorflow. To download it use pull command.

docker pull tensorflow/tensorflow

A new image will be downloaded with support of tensorflow only. We manually need to download other libraries.

You will see this image come up. Now using the command launch one container of it and manually download it and then commit or else again use Dockerfile to create one image. Simple way is to go and manually download because this will save our time from troubleshooting. To launch container:

docker container run -it --name tensor tensorflow/tensorflow

This will land you inside the tensorflow container. Here using pip download the required modules.

pip install keras
pip install numpy
pip install pandas
pip install pillow
pip install scikit-learn

This will prepare your container for all of the CNN & ANN requirements.

Then commit this image:

docker container commit tensor tensorflow/tensorflow:v1

This will create one image with name tensorflow/tensorflow:v1

So, till now our one requirement is finished i.e. to have our images ready with us for training our ML Model. Now our main job starts to integrate this with Jenkins and Docker.

Before moving further, we will be using one concept of Transfer Learning, using the MobileNet to train our model on monkey breed. I have provided code on my github link you can download it from there.

What is transfer learning?

In my previous article, I have described this topic in very detail. Tho I will brief about it.

Transfer Learning is a way in which we can use the intelligence of the pre-trained model to train our new dataset. It is helpful as it requires very less resources and can give high accuracy in limited images. In this example you will see the power of it and how it becomes more robust when we integrate this with our Devops concepts.

For now we have the code with us and our docker images are with us. So let's start integrating all the things.

JOB1: download_classify

This job will download the code from GitHub and classify whether the code is for CNN or ANN or Regression/Classification, accordingly it will copy it one folder created. Later this folder will be treated as volume for the container.

So, my file system is like this: /mlops_project->/download_classify->/ann /cnn /lrc

I already have some files here because I have tested my setup. This is just to understand the file system.

Now create one repository in GitHub and initialize it, if you don't know much about git & github please visit my previous articles for it.

Here will be uploading out ML code, and from here everything will be automated and we will be getting our desired output.

This is how your repository looks like, obviously the name differs. Now lets create our job.

This are the configuration of the JOB1. This is the build script we used:

sudo cp -rf * /mlops_project/download_classify/


if sudo grep -r Conv2D *
then
fe=$(sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo cp -rf $fe /mlops_project/download_classify/cnn
sudo docker container run -dit -v /mlops_project/download_classify/cnn:/root/ --name cnnmodel tensorflow/tensorflow:latest
elif sudo grep -r Dense * && ! sudo grep -r Conv2D *
then
fe=$(sudo grep -r Dense * && ! sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo cp -rf $fe /mlops_project/download_classify/ann
sudo docker container run -dit -v /mlops_project/download_classify/ann:/root/ --name annmodel tensorflow/tensorflow:latest
elif sudo grep -r sklearn
then
fe=$(sudo grep -r sklearn | cut -d ":" -f 1)
sudo cp -rf $fe /mlops_project/download_classify/lcr
sudo docker container run -dit -v /mlops_project/download_classify/lcr:/root/ --name lcrmodel lcrpython
else
echo "We don't support this model"
fi

Here I have used basic Linux command to classify whether it is ANN, CNN or LRC(Linear Regression or Classification).

sudo grep -r Conv2D *

This will search for Conv2D keyword in all the current directory( hence we have used -r for recursively ).

sudo grep -r Conv2D * | cut -d ":" -f 1

Here what so ever result we are getting from grep -r Conv2D we are giving it to cut -d ":" -f 1 command, what this will do is, cut command will only take the first part before the deliminator " : ". Because grep will give output as filename:lines where it found Conv2D so we just tool the first part which is filename which contains Conv2D will be in CNN, similarly if the code contains Dense but not Conv2D is classified under ANN and the file which has sklearn keyword can be classified under Regression or Classification. Since the ANN&CNN can be trained under same image we don't have to make one separate for each of them. Similarly for Regression or Classification.

Then we have mounted the folder where we will copy our code into the container using this commad:

sudo docker container run -dit -v /mlops_project/download_classify/cnn:/root/ --name cnnmodel tensorflow/tensorflow:latest

Here we have copied the file in the location /mlops_project/download_classify/cnn of our redhat vm and mounted it to the /root directory of the container. Hence we have given the ML program file to the container. Now we have to execute it. Internally at the end of the code we have to add a snippet of code which will get the validation accuracy and and check if it is less than 92 %, Our retrain job will change some hyper-parameter automatically and again train the code. This cycle goes on until be get accuracy greater than 92%.

JOB2: exec

Here we will execute our machine learning model code which we mounted inside the containers /root directory, using normal docker exec command. Here are the configuration of the JOB2.

Here is the build code.

if sudo docker ps | grep cnnmodel
then
fe=$(sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec cnnmodel python /root/$fe
elif sudo docker ps | grep annmodel
then
fe=$(sudo grep -r Dense * && ! sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec annmodel python /root/$fe
elif sudo docker ps | grep lcrmodel
then
fe=$(sudo grep -r sklearn * | cut -d ":" -f 1)
sudo docker container exec lcrmodel python /root/$fe
else
echo "Something is wrong check container"
fi

Here you can see first we are checking which container is running then we are executing the code using docker container exec cnnmodel pyhton /root/$fe here $fe contains the python code name. Similarly for all the models.

JOB3: retrain

Here what we want to achieve is, if the accuracy is not greater than 92 we change some parameter and train it again. For this we have one pre-requisite. Our ML Program should have this below mentioned code.

final_accuracy=history.history["val_accuracy"][-1]
	print(final_accuracy)
	

	import os
	if final_accuracy < 0.92:
	    os.system("curl --user '<jenkins user>:<jenkins password>' https://192.168.99.102:8080/view/mlops/job/retrain/build?token=retrain")
	else:
	    print("Your New accuracy=",final_accuracy)

Here you can see we are getting the accuracy and comparing it, if its less than 92, then we have to trigger the retrain job using token. For that we have to generate token also. Trigger URL might be different for you and me, as I am doing it in my different list view, that's why it's showing some extra things.

These is the job configuration.

Here In build code I have used basic Linux operations to do this task, which increases the hidden layer with same configuration. Here is the code:

if sudo docker ps | grep cnnmodel
then
ln=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 2)
lt=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 3)
fi=$(sudo grep -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
let "ln+=1" 
hp="    top_model = Dense(512,activation='relu')(top_model)"
sudo sed -i -e $ln'i\\'"$hp" /mlops_project/download_classify/cnn/$fi
sudo sed -i -e $ln'i'\\"$lt" /mlops_project/download_classify/cnn/$fi
fe=$(sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec cnnmodel python /root/$fe
elif sudo docker ps | grep annmodel
then
ln=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
lt=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 2)
fi=$(sudo grep -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
sed -i -e $ln'i\\'"$hp" /mlops_project/download_classify/cnn/$fi
sed -i -e $ln'i\\'"$lt" /mlops_project/download_classify/cnn/$fi
fe=$(sudo grep -r Dense * && ! sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec annmodel python /root/$fe
else
echo "Something is wrong with the code"
fi

What new thing I have used is sed command, this helps to insert line at particular position. So here I am increasing my hidden layer that is Dense layer. When I will show you the results you will come to know just by little tweak in this existing code we will achieve accuracy which normally people don't achieve using the same code. I have tested and seen multiple code where they have got around 92-93% validation accuracy. So lets see what our setup predicts.

At last we will have one monitoring job on the running container, it checks every minute, if the, container goes down it will restart it and save the model, as the code is written in such a way that every time it updates the weight and store in file.

JOB4: monitor

There is nothing fancy in this, we can use kubernetes cluster for deployment also, which makes our work simpler.

Here is the build code.

if sudo grep -r Conv2D /var/lib/jenkins/workspace/download_classify/* 
then
if sudo docker container ls | grep cnnmodel
then
echo "Alright"
else
sudo docker container run -dit -v /mlops_project/download_classify/cnn:/root/ --name cnnmodel tensorflow/tensorflow:latest
fi
elif sudo grep -r !Conv2D /var/lib/jenkins/workspace/download_classify/* && Dense /var/lib/jenkins/workspace/download_classify/*
then
if sudo docker container ls | grep annmodel
then
echo "Alright"
else
sudo docker container run -dit -v /mlops_project/download_classify/ann:/root/ --name annmodel tensorflow/tensorflow:latest
fi
elif sudo grep -r sklearn /var/lib/jenkins/workspace/*
then
if sudo docker container ls | grep lrcmodel
then
echo "Alright"
else
sudo docker container run -dit -v /mlops_project/download_classify/lrc:/root/ --name lrcmodel lrcpython
fi
else
echo "We don't recognize this coding language"
fi

Overall scenario of JOB scheduling

download_classify---Downstream--->exec---Trigger--->retrain---Downstream--->monitor

Finally, I am very excited to show you the results which people are not getting normally in this dataset and how I got that much accuracy in just one small tweak.

Accuracy of the by default script which we get on internet is:

Code:

Here you can see they have used Three Dense layer, two with 1024 neurons and one with 512 neurons. We are going to tweak this using our MLOps power.

After training you might get some result! I am showing you how to use our created MLOps setup. I am running this and showing you just to have a comparison. In my case first time the accuracy was around 94%.

Now these are some results of our creation. By default while uploading to github I have kept just one dense layer.

And as soon as I commit it my JOB1 starts:

Then JOB2 automatically starts and start training our model, here I have already given the dataset using WinSCP because this big file can't be uploaded to GitHub.

You can see here after JOB1, JOB2 started and started training, you can see each epoch in detail from Console Output.

So, by one Dense layer we have achieved accuracy greater then 92 %, we got around 94% just on one Dense layer. So now I you have two way either manually trigger the retrain JOB or increase the threshold value from 92 to 95. Here once I have done manually to show you that it actually adds one Dense layer with indentation and runs succesfully.

You can see at the end we have two Dense layer with 1024 neurons, previous we only had one. Now we have two! Tho I manually triggered it. After this I will change the threshold and it will automatically trigger the retrain, add one Dense layer with 1024 neurons and give us the accuracy.

You can see here that the accuracy decreased but still above 92% and it even triggered the monitor job. Now lets change the threshold and see the results. Now we have modified the requirement to have compulsorily have one Dense layer with 512 neurons and one which we added by default. So the changes look something like this.

if sudo docker ps | grep cnnmodel
then
ln=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 2)
lt=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 3)
fi=$(sudo grep -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
let "ln+=1" 
hp="    top_model = Dense(512,activation='relu')(top_model)"
sudo sed -i -e $ln'i\\'"$hp" /mlops_project/download_classify/cnn/$fi
sudo sed -i -e $ln'i'\\"$lt" /mlops_project/download_classify/cnn/$fi
fe=$(sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec cnnmodel python /root/$fe
elif sudo docker ps | grep annmodel
then
ln=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
lt=$(sudo grep -n -r "Dense(1024,activation='relu')" * | cut -d ":" -f 2)
fi=$(sudo grep -r "Dense(1024,activation='relu')" * | cut -d ":" -f 1)
sed -i -e $ln'i\\'"$hp" /mlops_project/download_classify/cnn/$fi
sed -i -e $ln'i\\'"$lt" /mlops_project/download_classify/cnn/$fi
fe=$(sudo grep -r Dense * && ! sudo grep -r Conv2D * | cut -d ":" -f 1)
sudo docker container exec annmodel python /root/$fe
else
echo "Something is wrong with the code"
fi

This is our final modified code. Threshold is still 92%.

Here you can see we have added one Dense layer with 1024 and other with 512 neurons. Now let's see what our setup predicts.

Finally we achieved 94% accuracy!???

Remember here we didn't do anything manual, just to show you all how the retrain works I have done somethings manually, otherwise we were luck to have such a great accuracy at the beginning it self! This is the core and main reason to choose MLOps, because you just have to commit the code, rest all changes our MLOps setup will do for us.

Aditya Raj

SDE @Amazon | Ex-SDE Intern @Amazon | BTech CSE'22 @UPES

4 年

Hi Shubham!!! It was nice to see you work hard...But I would like to add two things which you could work on... 1) You worked only on improving accuracy in fully comnected layers, it will be good if you are able to improve the model by adding Conv2d+relu+maxpooling layers if its required too...2) You can upload large dataset in github too, you will have to use git bash/gitkraken these types of tools...

1 次回应

查看更多评论

要查看或添加评论，请登录

Shubham Bhalala的更多文章

Multi-tier Architecture with Frontend on Kubernetes (Minikube) and Backend on Amazon RDS (MySQL)

2020年9月11日

Multi-tier Architecture with Frontend on Kubernetes (Minikube) and Backend on Amazon RDS (MySQL)

Architecture We will first create a deployment of frontend on top of the minikube Kubernetes cluster using WordPress…
Hybrid Cloud Setup on GCP

2020年9月5日

Hybrid Cloud Setup on GCP

What is a hybrid cloud? In a fast-moving and agile world, we want to have resources with no limits, they have to be…

2 条评论
Linux CLI App integrating Firebase

2020年9月4日

Linux CLI App integrating Firebase

Have you ever imagined having your remote system handled using an app and moreover it will store all the command and…

1 条评论
Webserver configuration on AWS EC2 using Ansible

2020年8月19日

Webserver configuration on AWS EC2 using Ansible

Moving towards industry 4.0 the drastic change we will see or we have to get ready to face will be automation on an…
Raaga Multi-Media App built on flutter

2020年8月17日

Raaga Multi-Media App built on flutter

What is Flutter? Flutter is an open-source UI software development kit created by Google. It is used to develop…

3 条评论
Configure docker on managed node and configure container as web-server using Ansible

2020年8月2日

Configure docker on managed node and configure container as web-server using Ansible

What is Ansible? Ansible is a simple tool for automation, to be specific, configuration management, application…

3 条评论
Auto Deployment on AWS using Terraform with security concern

2020年7月14日

Auto Deployment on AWS using Terraform with security concern

In my previous article we have seen that how we can configure VPC and Subnet to make a secure architecture. In there we…
Auto creating of Web portal on cloud by creating VPC and configuring it for better security

2020年7月14日

Auto creating of Web portal on cloud by creating VPC and configuring it for better security

We have seen that many cloud platform face some security issues nowadays. This is because of their poor configuration.
Cloud Automation with Terraform

2020年7月13日

Cloud Automation with Terraform

In my previous article of cloud automation we have seen how we can integrate public cloud AWS with Terraform to…

3 条评论
Deployment on dynamic Jenkins Cluster with Build Configuration & Rolling Updates Automatically

2020年7月12日

Deployment on dynamic Jenkins Cluster with Build Configuration & Rolling Updates Automatically

In this fast growing world, to sustain in business we need to adopt the agile process or say Devops environment. The…

See all articles

Power Of ML-Ops

Shubham Bhalala

Data Scientist @System2 | MS Data Science @Columbia University | Data Science, MLOps Engineer

Docker image for Regression & Classification

JOB1: download_classify

JOB2: exec

JOB3: retrain

JOB4: monitor

Overall scenario of JOB scheduling

Shubham Bhalala的更多文章

社区洞察

其他会员也浏览了

Automated Testing in MLOps Pipelines: The Role of SRE in Ensuring Reliability

Multi-Cluster (Hub-Spoke) Deployment with Argo CD

MLOps - Challenges and Solutions

?? The tool that will replace your DevOps/SRE/System Engineering team - DevOps GPT ??

?? Mastering Local Kubernetes Clusters with Kind: A Hands-On Guide to run k8s clusters locally

Test-Driven Development For Feature Engineering Microservices

Building AI-powered Reason and Act (ReAct) SRE Agent with LLM (Ollama and OpenAI), Langchain and Complex Prompts

Docker image for Regression & Classification

JOB1: download_classify

JOB2: exec

JOB3: retrain

JOB4: monitor

Overall scenario of JOB scheduling

Shubham Bhalala的更多文章

Multi-tier Architecture with Frontend on Kubernetes (Minikube) and Backend on Amazon RDS (MySQL)

Hybrid Cloud Setup on GCP

Linux CLI App integrating Firebase

Webserver configuration on AWS EC2 using Ansible

Raaga Multi-Media App built on flutter

Configure docker on managed node and configure container as web-server using Ansible

Auto Deployment on AWS using Terraform with security concern

Auto creating of Web portal on cloud by creating VPC and configuring it for better security

Cloud Automation with Terraform

Deployment on dynamic Jenkins Cluster with Build Configuration & Rolling Updates Automatically

社区洞察

其他会员也浏览了

Automated Testing in MLOps Pipelines: The Role of SRE in Ensuring Reliability

Multi-Cluster (Hub-Spoke) Deployment with Argo CD

MLOps - Challenges and Solutions

?? The tool that will replace your DevOps/SRE/System Engineering team - DevOps GPT ??

?? Mastering Local Kubernetes Clusters with Kind: A Hands-On Guide to run k8s clusters locally

Test-Driven Development For Feature Engineering Microservices

Building AI-powered Reason and Act (ReAct) SRE Agent with LLM (Ollama and OpenAI), Langchain and Complex Prompts