A project on complete automation of DL model auto-tuning:-
Saranya Chattopadhyay
Full Stack Developer - DevSecOps @IBM ? DevOps Practitioner ? Ex Intern @CommVault, HighRadius ? 2x GCP, 1x Microsoft, 1x RedHat Certified Engineer
Hey all! Ever thought of the idea that how cool it would have been if our ML / DL model was capable of auto-tuning itself in order to achieve even greater accuracy, in simple words what we mean by tweaking a model? Well that's what me and my friend Saptarsi have deployed in this end-to-end automation project. We would request you all to just give a quick read to this article where we have tried to cover all the concepts used in the project.
Problem Statement:-
1. Create container image that’s has Python:3 and keras or numpy installed using Dockerfile
2. When we launch this image, it should automatically start to train the model in the container.
3. Create a job chain of job1, job2, job3, job4 and job5 in Jenkins :
i) Job1 : Pull the Github repo automatically when some developers push repo to Github.
ii) Job2 : By looking at the code or program file, Jenkins should automatically start the respective machine learning software installed interpreter install image container to deploy code and start training( ex. If code uses CNN, then Jenkins should start the container that has already installed all the softwares required for the CNN processing).
iii) Job3 : Predict accuracy or metrics from the trained model.
iv) Job4 : if metrics accuracy is less than 80% , then tweak the machine learning model architecture.
v) Job5 : Retrain the model or notify that the best model is being created.
4. Create one extra job Job6 for monitor : If container where app is running. fails due to any reason then this job should automatically start the container again from where the last trained model left.
Solution performed:-
CONFIGURING THE IMAGES AND JOBS
Let's start with building the docker images in Redhat8. For this, we are gonna use Dockerfile. We will build two images - one which will the environment of ANN and another which will have environment for typical ML.
Docker file for ANN is -
Dockerfile for ML is -
We will build the Dockerfiles using the docker build command and finally when our images are ready, we can view them using docker images command. (Note : In the screenshot, our created images are having the names ml:v1 and cnn:v1. P.S-Please bear with the fact that the image has been named cnn:v1 although it has the environment of ANN :-p)
Switching to Windows10, let's write the python codes required in the use-case. We will require one DL code and another code for sending mail when required accuracy reached. For the DL code, we chose the very famous MNIST Handwritten Digits dataset. Once the codes are written in a mail_code.py file and the initial accuracy has been noted in Accuracy.txt, they will be uploaded in the github (you can go through them via the link provided in this post). Along with this, we will also create a post-commit file so that any changes made on local repo can be automatically pushed to the github - yet another automation.
Initially, the github repository would look somewhat like this-
Now, our codes are on the SCM system and thus, we are all set to start with the Jenkins jobs.
- Configuration of Job1: pull_code: This job will be pulling the code from github whenever there is a push (for this it will have a look into github every minute) and copy the contents into /root/mlops_tweaktask folder of Redhat8. The configuration will be as follows:-
- Configuration of Job2: launch_image: This job will launch the respective image, i.e. ml:v1 if the code is of typical ML or cnn:v1 if the code is of ANN, and accordingly execute the model_code.py file to train the model and note it in Accuracy.txt file. The configuration will be as follows:-
- Configuration of Job3: check_accuracy: This job will read the Accuracy.txt file and get the accuracy of the model. The benchmark accuracy set is 96%. If the accuracy obtained is less than 96%, the build of the job will fail and trigger tweaking of the model. If the required accuracy is achieved, the job will succeed and add and commit the better accuracy to github followed by triggering the sending of success notification to the developer. Since this job will be accessing github, we need to give the github credentials in the SCM tab. The configuration will be as follows:-
- Configuration of Job4: tweak_job: Frankly, this job is the most brain-storming one and requires too much of research regarding how to tweak the model. It will build if the 3rd job of checking accuracy fails. Finally, we could come up with the lines of code that can fulfill our requirements. After tweaking, the tweaked model will be added to github by the job itself and trigger building of Job2 again for retraining. Here, we again need to provide the github credentials in the SCM tab. Other configurations are as follows:-
After the commits of better accuracy and tweaked mode, the github repository will look somewhat like this-
- Configuration of Job5: send_mail: Once the Job3 is successful, i.e. when required accuracy is achieved, this job will send a notification email to the developer regarding the success. The configuration will be as follows:-
- Configuration of Job6: monitoring_job: This job will be doing the task of kubernetes. If there is a problem with running of the image, i.e. if the environment fails, this job will relaunch the respective image and resume the chain again. The configuration will be as follows:-
WORKING OF THE PROJECT
The initial architecture used in the model yielded us an accuracy of 94% and our benchmark set was 96%. Thus the job chain was initiated. Initial architecture used is as follows:-
After successful running of the chain, our model was tweaked and yielded an accuracy of 97%, thus achieving the benchmark. The tweaked architecture is as follows:-
After all the jobs run successful, the Jenkins Dashboard would look like this:-
Mail sent on successful model training and accuracy achievement.
Thus, our end-to-end automation of DL code auto-tuning was successful. This idea can be of great use in industry, as improper accuracy of models can create a setback in proper predictions and many more use-cases. Implementation of automated auto-tuning can reduce be very much faster as manual tweaking of model and training it again and again can be really time-taking. If this automation is deployed on platforms like AWS Cloud, even problems of RAM and CPU consumption of the machine can be resolved.
We thank our mentor Mr. Vimal Daga Sir from the bottom of our heart for giving us the opportunity to implement this wonderful automation task. We could really learn and clear many concepts through this task.
A PROJECT BY SAPTARSI ROY AND SARANYA CHATTOPADHYAY
SWE @Google | GSoD '21 @PyTorch-Ignite | Google WE Scholar '20 | LinkedIn CoachIn Mentee '21 | Core Member @DSCKIIT | ex ML @Fynd, Golang @SigScalr, Backend @Staple
4 å¹´Wow this is pretty darn good!
Software Engineer 1 at Dell Technologies || Google Certified Associate Cloud Engineer || Redhat certified Engineer || EX-180 Certified || ARTH-2020 Learner @LW
4 å¹´Well done