Docker and Kubernetes for Data Science
Source: Center image from google images

Docker and Kubernetes for Data Science

Tensorflow, pytorch, pandas, numpy, protobuf, dask, sklearn, keras, xgboost, lightGBM, scipy and the list goes on. Equivalent set of packages are available in R and other languages as well

Every package has multiple versions that is maintained active and on top of it there is nightly build to work on latest and greatest. Tensorflow Nightly, Pytorch Nightly etc

Package and version apart, there is dependency between every package and version. Pytorch might use X version of numpy as against tensorflow that might use Y

It does not stop here there are device accelerator dependency with packages. Tensorflow CUDA, Intel optimized Tensorflow and maybe tomorrow OpenCL, AMD GPU etc

Above dependency is fine if one is running experiments in local laptop in limited scale. Think of enterprise where hundreds of machine learning and AI projects running with tooling and infrastructure managed by centralized IT team. Every project brings in it's own dependency, packages and relevant version. IT team constantly ends up maintaining multiple custom environments at the same time ensuring old code does not break

Even if the environment is assumed to be maintained by individual data science team that have their own custom environment it becomes difficult and time consuming for software engineers to replicate those custom environment in production

If you think that's all, no it is not. Difference in operating system between desktop to training servers to deployment servers and maybe to cloud

Now is that all, No.. But I will stop here and focus on solution to overcome this custom environment challenges and one that brings in more collaboration between Data Scientist, IT Team, Software Engineers and other key stakeholders

You got it. It's Containers...While there are many container environment I am going to more talk on Docker and for orchestration containers we are going to focus on Kubernetes

What is Docker?

Docker allows applications to be packaged in self-contained environments aiding in quicker deployments and bringing in closer parity with training or development environments

What is Kubernetes?

Kubernetes automates container provisioning, networking, load-balancing, security and scaling

Kubernetes make development and deployment machine learning models Simple, Consistent and scalable

There are many benefits of using containers and kubernetes and they come hands in entire data science lifecycle starting from training till deployment

My Video on Docker and Kubernetes for Data Science on YouTube

You can also subscribe to my YouTube channel AIEngineering (AIEngineering) to get alerts as I post new videos

Now coming to the topic.. Containers play important role in

?? Infrastructure

?? Enabling Multi Tenancy

?? Hybrid Cloud

?? Tooling and Reproducibility

?? Deployment

Infrastructure

Infrastructure is one of the key investment in data science initiative. Enterprise have to invest on High performance system/GPU to accelerate data science initiative

If you look at model training typically there is heavy usage of Infrastructure for a brief period and post that cluster is almost idle. Kubernetes allows efficient sharing of resources and enabling multi tenancy. This way multiple machine learning projects within enterprise can share and utilize the infrastructure resources more efficiently

One additional benefit kubernetes brings in is to support cloud native as well as cloud ready architecture. It is easy to build an hybrid cloud strategy that allows to use on premise resources and burst out to cloud as needed. This way on premise investment can be kept to minimal and cloud can be used as extended infrastructure keeping cost in control

Tooling and Reproducibility

With Multi Tenancy every project might have their own tool and specific version of the tool. Containers help you create those virtualized environment with all dependency bundled in it

Also with all dependency bundled now once model is developed it is easy to pass on containers to software engineers to deploy rather sending lengthy installation instructions and dependency matrix

Deployment

Once model is trained just add a quick serving function to the container that loads the model and also bundles pre-processing pipeline dependency. Serving function here can be flask app or Java Spring application or TF serving or something else depending on model algorithm or tool used to develop model

Real benefit of kubernetes during deployment comes from scaling resources on need basis to meet business demand. During peak volume scale the serving instance pods and at normal peak run pods on usual capacity

You can also define deployment strategies like A/B deployment, Blue green and canary. This can help perform zero downtime deployment as well as enable champion challenger deployment strategies

Rakshith Nairy B

Data Science consultant at Capgemini

5 年

Pooja Shetty check this

Pranav Verma

Founder & CEO @ Busigence - Decision Intelligence Company [Robonate | Veriphi | H2HData]

5 年

Srivatsan Srinivasan Want to discuss? Your email please

回复
Tarek DAHEL

Chef de projet data Marketing, retail, activation, adobe, cdp

5 年

Thanks for sharing it

Priyash Maini

Senior Data Engineer at CVS

5 年

Parinita Vinod as an expert in docker and kubernetes I am sure you will like this article

Kaushik B.

Co-Founder, CTO & Chief Data Scientist | AI Solutions at Global Scale | US Patent | Ex-Intel | ML & AI Advisor

5 年

A quick hands-on (how-to kind) example for reference would have been really handy.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了