Docker and Kubernetes for Data Science
Srivatsan Srinivasan
Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)
Tensorflow, pytorch, pandas, numpy, protobuf, dask, sklearn, keras, xgboost, lightGBM, scipy and the list goes on. Equivalent set of packages are available in R and other languages as well
Every package has multiple versions that is maintained active and on top of it there is nightly build to work on latest and greatest. Tensorflow Nightly, Pytorch Nightly etc
Package and version apart, there is dependency between every package and version. Pytorch might use X version of numpy as against tensorflow that might use Y
It does not stop here there are device accelerator dependency with packages. Tensorflow CUDA, Intel optimized Tensorflow and maybe tomorrow OpenCL, AMD GPU etc
Above dependency is fine if one is running experiments in local laptop in limited scale. Think of enterprise where hundreds of machine learning and AI projects running with tooling and infrastructure managed by centralized IT team. Every project brings in it's own dependency, packages and relevant version. IT team constantly ends up maintaining multiple custom environments at the same time ensuring old code does not break
Even if the environment is assumed to be maintained by individual data science team that have their own custom environment it becomes difficult and time consuming for software engineers to replicate those custom environment in production
If you think that's all, no it is not. Difference in operating system between desktop to training servers to deployment servers and maybe to cloud
Now is that all, No.. But I will stop here and focus on solution to overcome this custom environment challenges and one that brings in more collaboration between Data Scientist, IT Team, Software Engineers and other key stakeholders
You got it. It's Containers...While there are many container environment I am going to more talk on Docker and for orchestration containers we are going to focus on Kubernetes
What is Docker?
Docker allows applications to be packaged in self-contained environments aiding in quicker deployments and bringing in closer parity with training or development environments
What is Kubernetes?
Kubernetes automates container provisioning, networking, load-balancing, security and scaling
Kubernetes make development and deployment machine learning models Simple, Consistent and scalable
There are many benefits of using containers and kubernetes and they come hands in entire data science lifecycle starting from training till deployment
My Video on Docker and Kubernetes for Data Science on YouTube
You can also subscribe to my YouTube channel AIEngineering (AIEngineering) to get alerts as I post new videos
Now coming to the topic.. Containers play important role in
?? Infrastructure
?? Enabling Multi Tenancy
?? Hybrid Cloud
?? Tooling and Reproducibility
?? Deployment
Infrastructure
Infrastructure is one of the key investment in data science initiative. Enterprise have to invest on High performance system/GPU to accelerate data science initiative
If you look at model training typically there is heavy usage of Infrastructure for a brief period and post that cluster is almost idle. Kubernetes allows efficient sharing of resources and enabling multi tenancy. This way multiple machine learning projects within enterprise can share and utilize the infrastructure resources more efficiently
One additional benefit kubernetes brings in is to support cloud native as well as cloud ready architecture. It is easy to build an hybrid cloud strategy that allows to use on premise resources and burst out to cloud as needed. This way on premise investment can be kept to minimal and cloud can be used as extended infrastructure keeping cost in control
Tooling and Reproducibility
With Multi Tenancy every project might have their own tool and specific version of the tool. Containers help you create those virtualized environment with all dependency bundled in it
Also with all dependency bundled now once model is developed it is easy to pass on containers to software engineers to deploy rather sending lengthy installation instructions and dependency matrix
Deployment
Once model is trained just add a quick serving function to the container that loads the model and also bundles pre-processing pipeline dependency. Serving function here can be flask app or Java Spring application or TF serving or something else depending on model algorithm or tool used to develop model
Real benefit of kubernetes during deployment comes from scaling resources on need basis to meet business demand. During peak volume scale the serving instance pods and at normal peak run pods on usual capacity
You can also define deployment strategies like A/B deployment, Blue green and canary. This can help perform zero downtime deployment as well as enable champion challenger deployment strategies
Data Science consultant at Capgemini
5 年Pooja Shetty check this
Founder & CEO @ Busigence - Decision Intelligence Company [Robonate | Veriphi | H2HData]
5 年Srivatsan Srinivasan Want to discuss? Your email please
Chef de projet data Marketing, retail, activation, adobe, cdp
5 年Thanks for sharing it
Senior Data Engineer at CVS
5 年Parinita Vinod as an expert in docker and kubernetes I am sure you will like this article
Co-Founder, CTO & Chief Data Scientist | AI Solutions at Global Scale | US Patent | Ex-Intel | ML & AI Advisor
5 年A quick hands-on (how-to kind) example for reference would have been really handy.