登录查看更多内容

Docker and Kubernetes for Data Science

Srivatsan Srinivasan

Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)

发布日期: 2019年10月16日

Tensorflow, pytorch, pandas, numpy, protobuf, dask, sklearn, keras, xgboost, lightGBM, scipy and the list goes on. Equivalent set of packages are available in R and other languages as well

Every package has multiple versions that is maintained active and on top of it there is nightly build to work on latest and greatest. Tensorflow Nightly, Pytorch Nightly etc

Package and version apart, there is dependency between every package and version. Pytorch might use X version of numpy as against tensorflow that might use Y

It does not stop here there are device accelerator dependency with packages. Tensorflow CUDA, Intel optimized Tensorflow and maybe tomorrow OpenCL, AMD GPU etc

Above dependency is fine if one is running experiments in local laptop in limited scale. Think of enterprise where hundreds of machine learning and AI projects running with tooling and infrastructure managed by centralized IT team. Every project brings in it's own dependency, packages and relevant version. IT team constantly ends up maintaining multiple custom environments at the same time ensuring old code does not break

Even if the environment is assumed to be maintained by individual data science team that have their own custom environment it becomes difficult and time consuming for software engineers to replicate those custom environment in production

If you think that's all, no it is not. Difference in operating system between desktop to training servers to deployment servers and maybe to cloud

Now is that all, No.. But I will stop here and focus on solution to overcome this custom environment challenges and one that brings in more collaboration between Data Scientist, IT Team, Software Engineers and other key stakeholders

You got it. It's Containers...While there are many container environment I am going to more talk on Docker and for orchestration containers we are going to focus on Kubernetes

What is Docker?

Docker allows applications to be packaged in self-contained environments aiding in quicker deployments and bringing in closer parity with training or development environments

What is Kubernetes?

Kubernetes automates container provisioning, networking, load-balancing, security and scaling

Kubernetes make development and deployment machine learning models Simple, Consistent and scalable

There are many benefits of using containers and kubernetes and they come hands in entire data science lifecycle starting from training till deployment

My Video on Docker and Kubernetes for Data Science on YouTube

You can also subscribe to my YouTube channel AIEngineering (AIEngineering) to get alerts as I post new videos

Now coming to the topic.. Containers play important role in

?? Infrastructure

?? Enabling Multi Tenancy

?? Hybrid Cloud

?? Tooling and Reproducibility

?? Deployment

Infrastructure

Infrastructure is one of the key investment in data science initiative. Enterprise have to invest on High performance system/GPU to accelerate data science initiative

If you look at model training typically there is heavy usage of Infrastructure for a brief period and post that cluster is almost idle. Kubernetes allows efficient sharing of resources and enabling multi tenancy. This way multiple machine learning projects within enterprise can share and utilize the infrastructure resources more efficiently

One additional benefit kubernetes brings in is to support cloud native as well as cloud ready architecture. It is easy to build an hybrid cloud strategy that allows to use on premise resources and burst out to cloud as needed. This way on premise investment can be kept to minimal and cloud can be used as extended infrastructure keeping cost in control

Tooling and Reproducibility

With Multi Tenancy every project might have their own tool and specific version of the tool. Containers help you create those virtualized environment with all dependency bundled in it

Also with all dependency bundled now once model is developed it is easy to pass on containers to software engineers to deploy rather sending lengthy installation instructions and dependency matrix

Deployment

Once model is trained just add a quick serving function to the container that loads the model and also bundles pre-processing pipeline dependency. Serving function here can be flask app or Java Spring application or TF serving or something else depending on model algorithm or tool used to develop model

Real benefit of kubernetes during deployment comes from scaling resources on need basis to meet business demand. During peak volume scale the serving instance pods and at normal peak run pods on usual capacity

You can also define deployment strategies like A/B deployment, Blue green and canary. This can help perform zero downtime deployment as well as enable champion challenger deployment strategies

Rakshith Nairy B

Data Science consultant at Capgemini

5 年

Pooja Shetty check this

1 次回应

Pranav Verma

Founder & CEO @ Busigence - Decision Intelligence Company [Robonate | Veriphi | H2HData]

5 年

Srivatsan Srinivasan Want to discuss? Your email please

Tarek DAHEL

Chef de projet data Marketing, retail, activation, adobe, cdp

5 年

Thanks for sharing it

1 次回应

Priyash Maini

Senior Data Engineer at CVS

5 年

Parinita Vinod as an expert in docker and kubernetes I am sure you will like this article

1 次回应

Kaushik B.

Co-Founder, CTO & Chief Data Scientist | AI Solutions at Global Scale | US Patent | Ex-Intel | ML & AI Advisor

5 年

A quick hands-on (how-to kind) example for reference would have been really handy.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Docker and Kubernetes for Data Science

Srivatsan Srinivasan

Chief Data Scientist | Gen AI | AI Advocate | YouTuber (bit.ly/AIEngineering)

What is Docker?

What is Kubernetes?

My Video on Docker and Kubernetes for Data Science on YouTube

Infrastructure

Tooling and Reproducibility

Deployment

更多精彩文章

社区洞察

其他会员也浏览了

Issue #192 - THE ML ENGINEER ??

Unlocking the Power of Data: A Comprehensive Guide to Our Data Science Course

Approaching (Almost) Any Machine Learning Problem

Issue #171 - THE ML ENGINEER ??

DREAM: Distributed RAG Experimentation Framework

Machine Learning Project: Heart Attack Prediction Analysis

Innovate ML Model store with MongoDB as a Service

Navigating Data Analytics with Numpy in Azure Cloud and Gen AI: A Comprehensive Guide

Machine learning production systems

ML Model Deployment Considerations

What is Docker?

What is Kubernetes?

My Video on Docker and Kubernetes for Data Science on YouTube

Infrastructure

Tooling and Reproducibility

Deployment

Journey into Data Science - Year of Learning Together

2020年9月8日

How to build a compelling data science portfolio?

2020年5月19日

AIEngineering - Inside Story

2020年2月18日

Course Launch - Scaling and Accelerating Machine Learning Models

2020年2月4日

Skill up on new age data technologies

2019年12月17日

Business and Data Understanding in Data Science Lifecycle

2019年11月18日

Data, Artificial Intelligence and Cloud Trends for 2020 and Beyond

2019年10月29日

A Day in the life of Data Analyst

2019年10月7日

How to stand out in Data Science Interview?

2019年10月1日

Real World Machine Learning Pipeline (ML Engineering)

2019年9月24日

社区洞察

其他会员也浏览了

Issue #192 - THE ML ENGINEER ??

Unlocking the Power of Data: A Comprehensive Guide to Our Data Science Course

Approaching (Almost) Any Machine Learning Problem

Issue #171 - THE ML ENGINEER ??

DREAM: Distributed RAG Experimentation Framework

Machine Learning Project: Heart Attack Prediction Analysis

Innovate ML Model store with MongoDB as a Service

Navigating Data Analytics with Numpy in Azure Cloud and Gen AI: A Comprehensive Guide

Machine learning production systems

ML Model Deployment Considerations