Tips for building a scalable IT Infrastructure for running AI/ML
‘Systems are more important than algorithms’ could be the most profound learning for me in the past year while experimenting with AI/ML systems. Like most of us taking to AI, I was engrossed with the latest papers and models till I hit the proverbial iceberg - improperly designed IT Infrastructure.
My first brush with production grade AI came when we tried to take into production a model built on Bi-Directional LSTM to achieve hierarchical paragraph text summarisation. A bunch of very bright engineers that we worked with, had developed the model using Torch and trained it primarily on a desktop with NVIDIA processor. The model was quite successful. But the sponsor wanted to take it to the next level (i.e) uninterrupted production deployment by taking it to the cloud. Our choice was Microsoft Azure.
Here is where the iceberg showed up. The moment we put the code on the high power GPU based VM costing nearly $20 an hour, the code broke. The code was breaking over CUDA driver dependancies and many other places. We spent a few painful days and a lot of GPU $$ while we ‘trouble-shoot’ed the issue. But the learnings were immense. I just thought of sharing the same in this blog.
- Tip no 1: Dockerizing your models is a damn good idea: Of late, when we look at any developer issues, the first question that comes to our minds is ‘Why cant docker solve this?’. But since we were quite unaware of deploying GPU based code on docker, we skipped this question. Till we hit upon this blog :
https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/
The essence of it is captured in the following diagram:
Containerising the application takes all the headache of portability and instead of thinking of it later, you must think of it as the first step. Now, we are confident of porting this application on any NVDIA GPU based cloud that may be cheaper, faster or more secure, at any point in time without spending sleepless nights.
- Tip no 2: Devops practices are invaluable: Maintaining a AI/ML model post production is an absolute nightmare. Apart from the code build and configuration issues, testing takes a new dimension compared to the traditional software. Here, testing is not just for functionality of the code but performance metrics like prediction accuracy, recall and loss etc need to be measured and fed back to the developers on time.
That said, monitoring the systems in production and logging are two most important Devops practices that should be the basic requirement we must put in place before moving the code to production. It assumes even more critical importance if you are planning to run the model using distributed computing systems like Spark/Hadoop and/or BigDL.
Some of the good monitoring systems that have been used by ML/AI engineers are Ganglia and Datadog. We can use any open source logging and analytics systems like the Elastic stack. My recommendations notwithstanding, a good systems engineer will know whats best for your application and you must consult them.
- Tip no 3: Limit the need for GPUs: While training on many millions of records offline, GPUs can crunch the training time drastically. But in many an application, the inference can be done with CPUs - if not one, a cluster of them - using something like Intel developed but open sourced BigDL which exploits the Intel Xeon math library or Distributed Tensorflow or if you are more adventurous, look at something like Uber’s-Horovod-like-model to orchestrate across multiple CPU/GPU combinations. But in essence, if you really want to democratise AI/ML in your organisation, you must bring the cost of resources down and limiting GPUs is one way to accomplish that.
- Tip no 4: Take A Lot Of Care In Building The Data Pipeline: This is the most important part of the system that and the systems engineers should spend lots and lots of time to select the proper stack for the entire lifecycle of data from ingestion to preprocessing to processing to inference. I will be writing a separate blog on this topic as this needs a very special and in-depth treatment. At a glance, I am very impressed with PANCAKESTACK (Presto, Arrow, Ni-Fi, Cassandra, Air-Flow, Kafka, Elastic, Spark, Tensorflow, Algebird, CoreNLP, Kibana) and I will detail out my experience in using this stack in my blog.
I would NOT say this is ALL you need for building a successful IT Infra for AI/ML but a limited knowledge that I developed with experience and wanted to share with my network. I look forward to more knowledgable engineers and data scientists to add to my thoughts as they deem fit.
IT Infrastructure & Management Consultant | Mentor | Program Management | Coach
5 年Well thought article?