登录查看更多内容

Tips for building a scalable IT Infrastructure for running AI/ML

Renga Bashyam

发布日期: 2018年3月4日

‘Systems are more important than algorithms’ could be the most profound learning for me in the past year while experimenting with AI/ML systems. Like most of us taking to AI, I was engrossed with the latest papers and models till I hit the proverbial iceberg - improperly designed IT Infrastructure.

My first brush with production grade AI came when we tried to take into production a model built on Bi-Directional LSTM to achieve hierarchical paragraph text summarisation. A bunch of very bright engineers that we worked with, had developed the model using Torch and trained it primarily on a desktop with NVIDIA processor. The model was quite successful. But the sponsor wanted to take it to the next level (i.e) uninterrupted production deployment by taking it to the cloud. Our choice was Microsoft Azure.

Here is where the iceberg showed up. The moment we put the code on the high power GPU based VM costing nearly $20 an hour, the code broke. The code was breaking over CUDA driver dependancies and many other places. We spent a few painful days and a lot of GPU $$ while we ‘trouble-shoot’ed the issue. But the learnings were immense. I just thought of sharing the same in this blog.

Tip no 1: Dockerizing your models is a damn good idea: Of late, when we look at any developer issues, the first question that comes to our minds is ‘Why cant docker solve this?’. But since we were quite unaware of deploying GPU based code on docker, we skipped this question. Till we hit upon this blog :

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/

The essence of it is captured in the following diagram:

Containerising the application takes all the headache of portability and instead of thinking of it later, you must think of it as the first step. Now, we are confident of porting this application on any NVDIA GPU based cloud that may be cheaper, faster or more secure, at any point in time without spending sleepless nights.

Tip no 2: Devops practices are invaluable: Maintaining a AI/ML model post production is an absolute nightmare. Apart from the code build and configuration issues, testing takes a new dimension compared to the traditional software. Here, testing is not just for functionality of the code but performance metrics like prediction accuracy, recall and loss etc need to be measured and fed back to the developers on time.

That said, monitoring the systems in production and logging are two most important Devops practices that should be the basic requirement we must put in place before moving the code to production. It assumes even more critical importance if you are planning to run the model using distributed computing systems like Spark/Hadoop and/or BigDL.

Some of the good monitoring systems that have been used by ML/AI engineers are Ganglia and Datadog. We can use any open source logging and analytics systems like the Elastic stack. My recommendations notwithstanding, a good systems engineer will know whats best for your application and you must consult them.

Tip no 3: Limit the need for GPUs: While training on many millions of records offline, GPUs can crunch the training time drastically. But in many an application, the inference can be done with CPUs - if not one, a cluster of them - using something like Intel developed but open sourced BigDL which exploits the Intel Xeon math library or Distributed Tensorflow or if you are more adventurous, look at something like Uber’s-Horovod-like-model to orchestrate across multiple CPU/GPU combinations. But in essence, if you really want to democratise AI/ML in your organisation, you must bring the cost of resources down and limiting GPUs is one way to accomplish that.
Tip no 4: Take A Lot Of Care In Building The Data Pipeline: This is the most important part of the system that and the systems engineers should spend lots and lots of time to select the proper stack for the entire lifecycle of data from ingestion to preprocessing to processing to inference. I will be writing a separate blog on this topic as this needs a very special and in-depth treatment. At a glance, I am very impressed with PANCAKESTACK (Presto, Arrow, Ni-Fi, Cassandra, Air-Flow, Kafka, Elastic, Spark, Tensorflow, Algebird, CoreNLP, Kibana) and I will detail out my experience in using this stack in my blog.

I would NOT say this is ALL you need for building a successful IT Infra for AI/ML but a limited knowledge that I developed with experience and wanted to share with my network. I look forward to more knowledgable engineers and data scientists to add to my thoughts as they deem fit.

Rajendren TGR

IT Infrastructure & Management Consultant | Mentor | Program Management | Coach

5 年

Well thought article?

要查看或添加评论，请登录

Renga Bashyam的更多文章

How well GPT-4 performs in simple but nuanced tasks?

2023年3月21日

How well GPT-4 performs in simple but nuanced tasks?

How does GPT4 perform in highly nuanced language tasks that's fairly simple for a human adult? My first test : Solving…

1 条评论
Rhythm is all that matters

2022年8月31日

Rhythm is all that matters

Our startup has been making some enviable progress with product development, especially over the past year. We have…

20 条评论
How to judge a person’s work ethics during an interview process?

2021年3月24日

How to judge a person’s work ethics during an interview process?

I consider work ethics as one of the top evaluation criteria for selecting my team members. But what is work ethics in…

19 条评论
What Makes AI In Medical Imaging Deeply Interesting?

2021年3月15日

What Makes AI In Medical Imaging Deeply Interesting?

Computer vision has come a long way since the advent of deep learning and especially since the breakthrough that was…
Automatic Speech Recognition on the Edge - A chimera or a reality?

2021年3月4日

Automatic Speech Recognition on the Edge - A chimera or a reality?

ASR or Automatic Speech Recognition is definitely knocking at the doors of industrial-applications. While Alexa and…

10 条评论
BigDL for Apache Spark: A Real Big Step For Deep Learning

2018年2月9日

BigDL for Apache Spark: A Real Big Step For Deep Learning

I had been in a few meetings in recent past where ‘productionizing’ a new Deep Learning based AI project was the…

7 条评论
What ails B2B startups?

2017年7月30日

What ails B2B startups?

I met with a leader of a digital program recently. His company pushed the digital envelope really hard.

5 条评论
What The AI Advocates Should Learn From Nuclear Tech

2017年7月19日

What The AI Advocates Should Learn From Nuclear Tech

Nobel laureate Richard P Feynman was just 26, twenty whole years younger than what Elon Musk is today, when he…
A car loses its driver, Indian IT gets its own

2017年7月15日

A car loses its driver, Indian IT gets its own

My first reaction to the news that Infosys doyen Sikka arrived in a driverless buggy for his quarterly media briefing…

6 条评论
Book : What to do when machines do everything : How to get ahead in a world of AI, Algorithms and Big Data by Malcom Frank, Paul Roehrig and Ben Pring

2017年2月4日

Book : What to do when machines do everything : How to get ahead in a world of AI, Algorithms and Big Data by Malcom Frank, Paul Roehrig and Ben Pring

As soon as I selected this book on Amazon.in, the site’s recommendation engine popped up another book as something I…

4 条评论

See all articles

Tips for building a scalable IT Infrastructure for running AI/ML

Renga Bashyam

Renga Bashyam的更多文章

社区洞察

其他会员也浏览了

? Minimal cost service mesh, Reducing cold-start-latency on GKE, Cluster API with kluctl, Varnish sharding, authz and authn with Istio and OPA

SAP AI Ambitions (Part II)

OpenAI's $500 Billion Project Stargate

OpenAI Raises $6.6 Bn, Closing in on Google-Level Valuation

?? Google's open-source contribution underwhelms, Magic AI builds a super software engineer, V-JEPA, Groq and more

Deploy Any Model on Any Compute, at Any Scale!??

A Deep-Dive into H100 Cloud GPUs for CXOs and Leaders

?? Compute as a Bond ??

OpenAI’s Stargate Project: A $500 Billion Leap in AI Infrastructure Development

Deep Dive: Unlocking Business Potential with the Google AI Stack - Infrastructure, Models, and Platforms

Renga Bashyam的更多文章

How well GPT-4 performs in simple but nuanced tasks?

Rhythm is all that matters

How to judge a person’s work ethics during an interview process?

What Makes AI In Medical Imaging Deeply Interesting?

Automatic Speech Recognition on the Edge - A chimera or a reality?

BigDL for Apache Spark: A Real Big Step For Deep Learning

What ails B2B startups?

What The AI Advocates Should Learn From Nuclear Tech

A car loses its driver, Indian IT gets its own

Book : What to do when machines do everything : How to get ahead in a world of AI, Algorithms and Big Data by Malcom Frank, Paul Roehrig and Ben Pring

社区洞察

其他会员也浏览了

? Minimal cost service mesh, Reducing cold-start-latency on GKE, Cluster API with kluctl, Varnish sharding, authz and authn with Istio and OPA

SAP AI Ambitions (Part II)

OpenAI's $500 Billion Project Stargate

OpenAI Raises $6.6 Bn, Closing in on Google-Level Valuation

?? Google's open-source contribution underwhelms, Magic AI builds a super software engineer, V-JEPA, Groq and more

Deploy Any Model on Any Compute, at Any Scale!??

A Deep-Dive into H100 Cloud GPUs for CXOs and Leaders

?? Compute as a Bond ??

OpenAI’s Stargate Project: A $500 Billion Leap in AI Infrastructure Development

Deep Dive: Unlocking Business Potential with the Google AI Stack - Infrastructure, Models, and Platforms