登录查看更多内容

How to build your own deep learning infrastructure

Aritra Ghosh

Founder at Vidyutva | EV | Solutions Architect | Azure & AI Expert | Ex- Infosys | Passionate about innovating for a sustainable future in Electric Vehicle Ecosystem and AI

发布日期: 2022年12月13日

Deep learning is a type of machine learning that uses large amounts of data and algorithms to build models that can learn and make predictions. Good infrastructure is crucial for success in deep learning, but thanks to open-source tools, anyone can build their own deep learning infrastructure. This has made the field more accessible and has contributed to the recent progress in the field.

In this article, we’ll discuss how deep learning research usually proceeds, describe the infrastructure choices we have to make to support it, and open-source?kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. I hope you find this article useful in building your own deep learning?infrastructure.

The use case:

A common process in deep learning involves starting with an idea and testing it on a small problem to see if it works. This allows researchers to quickly run a variety of experiments to see which approaches are most effective. To do this efficiently, they need to be able to easily access computing resources and run their experiments without encountering barriers.

Once a deep learning model has been tested and shown to be effective on a small scale, the next step is to try to improve it by pushing it to its limits and finding ways to overcome its limitations. This involves running the model repeatedly to see how it behaves under different conditions, and then making changes and adjustments to improve its performance. This process is similar to building any other type of software system, and it requires iterative experimentation and testing to identify and fix any issues.

Deep learning infrastructure must enable users to easily inspect and analyze their models, in order to understand how they are working and identify any potential issues. Providing only summary statistics is not sufficient, as it does not give users the detailed insights they need to improve their models.

Once a deep learning model has shown promise on a small scale, the next step is to scale it up to larger datasets and more computing resources. This typically involves running longer, more complex experiments that can take multiple days to complete. Careful experiment management and thoughtful hyperparameter selection are crucial in this phase, as they can significantly impact the quality of the results.

The early research process in deep learning is often unstructured and rapid, while the later stages are more methodical and time-consuming. However, both stages are important for achieving a great result.

An example:

The paper "Improved Techniques for Training GANs" describes several techniques for improving the training of generative adversarial networks (GANs), which are a type of machine learning model that involves two competing neural networks. The generator network tries to create fake data that is similar to the real data, while the discriminator network tries to identify which data is real and which is fake. A successful generator network is one that can consistently fool the discriminator network.

However, GANs have a potential failure mode known as "collapse," where the generator network always outputs the same sample, even if it is a realistic-looking one. The paper discusses a technique for addressing this issue and improving the performance of GANs.

The "Improved Techniques for Training GANs" paper describes a technique for addressing the issue of GAN collapse, where the generator network always outputs the same sample. The technique involves giving the discriminator network an entire minibatch of samples as input, rather than just a single sample. This allows the discriminator to detect when the generator is producing the same sample over and over, and sends gradients back to the generator to correct the problem.

The technique was initially tested on the ?MNIST?and?CIFAR-10 datasets, which allowed for rapid prototyping and iteration. The results on CIFAR-10 were particularly promising, with the best samples seen on the dataset.

However, in order to truly solve the problem and be useful, deep learning algorithms need to be scaled up to work on larger datasets such as ImageNet. The paper's author, Ian Goodfellow, then focused on scaling the model up to work on ImageNet.

No alt text provided for this image — openAI model learning to generate ImageNet images

To train a deep learning model on a large dataset like ImageNet, it is necessary to use multiple GPUs in parallel. This allows the model to process more data faster, but also requires careful experiment management to ensure that each experiment is as efficient as possible. The training process can still take many days even with multiple GPUs, so it is important to carefully log the results of each experiment and use that information to make informed decisions about how to improve the model.

While the results of the experiments described in the paper were good, they were not as good as the authors had hoped. This is not unusual in scientific research, as there are often many unknowns and unpredictable factors that can impact the outcome of an experiment. The authors continued to test different hypotheses and make adjustments to try to improve the performance of the model, but ultimately were not able to achieve the desired results.

Infrastructure:

Software

The majority of the research code used by openAI is written in Python, using libraries such as TensorFlow and Theano for GPU computing, and Numpy for CPU computing. Some researchers also use higher-level frameworks like Keras on top of TensorFlow.

Like many other researchers in the deep learning community, the authors use Python 2.7 and the Anaconda distribution, which includes convenient packaging for libraries such as ?OpenCV? and offers performance optimizations for some scientific libraries.

领英推荐

AIML 08- 23 Deep Learning Papers To Get You Started

Dr. Alok Tiwari 3 年前

Scaling Deep Learning: Highlights From The Startup.ml…

George Williams 7 年前

10 Advanced Deep Learning Courses that will Take Your…

Yachana S 1 年前

Hardware

In an ideal situation, doubling the number of nodes in a computing cluster will halve the runtime of a batch job. However, in deep learning, the speedup from using multiple GPUs is often less than linear, meaning that the benefit of adding more GPUs is not as great as one might expect. This means that to achieve the best performance, it is necessary to use high-quality GPUs.

In addition to using GPUs for deep learning, the authors also make use of CPUs for running simulators, reinforcement learning environments, and small-scale models that do not benefit from being run on a GPU. This highlights the importance of having a well-rounded and flexible computing infrastructure for deep learning research.

The authors of the "Improved Techniques for Training GANs" paper received a donation of compute resources from Amazon Web Services (AWS), which they use for CPU instances and for horizontally scaling up GPU jobs. They also run their own physical servers, primarily equipped with Titan X GPUs. The authors expect to use a hybrid cloud model for the long term, as it allows them to experiment with different GPUs, interconnects, and other technologies that may be important for the future of deep learning. This approach allows them to stay at the forefront of the field and continue making progress in their research.

Provisioning

The computing infrastructure as a product that must be user-friendly and easy to use. To achieve this, they use a consistent set of tools to manage all of their servers and ensure that they are configured as similarly as possible. This approach simplifies the process of setting up and running experiments, and allows researchers to focus on the science rather than the technical details of the infrastructure. By treating their infrastructure like a product, the authors are able to support the rapid, iterative experimentation that is essential for progress in deep learning.

openAI use Terraform to set up their cloud resources on AWS, including instances, network routes, and DNS records. The cloud and physical nodes run Ubuntu and are configured using Chef. To speed up the process of setting up new nodes, they use Packer to pre-bake cluster AMIs. The clusters use non-overlapping IP ranges and are interconnected over the public internet using OpenVPN on user laptops, and strongSwan on physical nodes.

openAI use a combination of NFS (on physical hardware), EFS, and S3 to store people's home directories, data sets, and results. This allows them to easily access and share data across their computing infrastructure.

Orchestration:

Scalable infrastructure can sometimes make simple tasks more difficult, so the authors of the "Improved Techniques for Training GANs" paper put equal effort into developing tools and processes for small- and large-scale jobs. They are also working on making distributed use cases as accessible as local ones, to support the flexible and iterative experimentation that is essential for deep learning research.

To support ad-hoc experimentation, the authors provide a cluster of SSH nodes with and without GPUs. They also use Kubernetes as their cluster scheduler for physical and AWS nodes. The cluster spans three AWS regions to ensure that they have enough capacity to handle their workload.

Kubernetes requires that each job be packaged as a Docker container, which provides dependency isolation and code snapshotting. However, building a new Docker container can add extra time to a researcher's iteration cycle, so the authors also provide tools to allow researchers to easily ship code from their laptops into a standard image. This speeds up the experimentation process and allows researchers to focus on the science.

openAI expose Kubernetes's flannel network directly to researchers' laptops, allowing them to easily access the network and their running jobs. This is particularly useful for accessing monitoring services like TensorBoard. The authors initially used a different approach that provided stricter isolation, but found that it added too much friction and made it difficult for researchers to access the services they needed. By exposing the flannel network directly, they were able to provide a simpler and more user-friendly experience for researchers.

kubernetes-ec2-autoscaler

open AI have a workload that is bursty and unpredictable, with some experiments requiring many more resources than others. This makes it challenging to manage their computing infrastructure, as they need to be able to quickly provision new resources as needed.

To address this issue, the authors use Kubernetes to manage their cloud infrastructure, and use Auto Scaling groups to dynamically provision new Kubernetes nodes. However, they found that it was difficult to correctly manage the size of the Auto Scaling groups, as AWS's Scaling Policies did not always provide the level of control they needed. To solve this problem, the authors developed kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes that allows them to more easily manage the size of their Auto Scaling groups.

While it may be tempting to use raw EC2 for large batch jobs, the authors found that the Kubernetes ecosystem provided many benefits, such as low-friction tooling, logging, monitoring, and the ability to manage physical nodes separately from the running instances. As a result, they chose to focus on making Kubernetes autoscale correctly rather than rebuilding their infrastructure on raw EC2.

The kubernetes-ec2-autoscaler works by polling the Kubernetes master's state to determine the current resource ask and capacity of the cluster. If there is excess capacity, it drains and ultimately terminates the relevant nodes. If more resources are needed, it calculates the number and type of servers that should be created and increases the size of the relevant Auto Scaling groups.

The autoscaler can handle multiple Auto Scaling groups, and can take into account resources beyond just CPU (such as memory and GPUs). It also supports fine-grained constraints on jobs, such as the AWS region and instance size. Additionally, it can handle overflow to a secondary AWS region in cases where the primary region reaches its capacity.

Overall, the plan is to focus on maximizing the productivity of deep learning researchers by providing a robust and user-friendly infrastructure. They continue to develop and improve their tools and workflow, and welcome contributions from others in the community.

Cloud Hacking for Startups

4,689 位关注者

Swati Bharti

Digital Marketer

1 年

This is a great introduction to machine learning infrastructure! I found an article that dives deeper into this topic: https://aitech.studio/aie/machine-learning-infrastructure/

要查看或添加评论，请登录

Aritra Ghosh的更多文章

Understanding MCP: Model Context Protocol for LLMs

2025年3月19日

Understanding MCP: Model Context Protocol for LLMs

The Model Context Protocol (MCP) is an emerging approach designed to standardize the way Large Language Models (LLMs)…
DeepSeek's Model Distillation Technique : Understand and Implement

2025年3月17日

DeepSeek's Model Distillation Technique : Understand and Implement

Introduction to Model Distillation Model distillation is a powerful technique in machine learning that allows us to…
Impact of Trump-Era Tariffs on U.S. Industries Relying on Indian Imports

2025年3月6日

Impact of Trump-Era Tariffs on U.S. Industries Relying on Indian Imports

Introduction The Trump administration (2017–2021) pursued an “America First” trade policy that raised tariffs on many…

3 条评论
Implementing Hub-and-Spoke Architecture with Azure Databricks

2025年2月8日

Implementing Hub-and-Spoke Architecture with Azure Databricks

The Hub-and-Spoke model in Azure Databricks is designed to enhance security, governance, and scalability by separating…
What are the Challenges Faced by Organizations in Executing AI & Data Projects?

2025年1月7日

What are the Challenges Faced by Organizations in Executing AI & Data Projects?

1. Lack of Clear Strategy and Alignment with Business Goals AI and data projects often fail to deliver value because of…

2 条评论
Azure Data Engineering Cheat Sheet

2024年12月4日

Azure Data Engineering Cheat Sheet

Data engineering has become an essential skill set for developers looking to work with big data, analytics, and cloud…
Can India Achieve Exponential Economic Growth?

2024年12月2日

Can India Achieve Exponential Economic Growth?

India, with its vibrant economy and youthful population, is currently the world's fifth-largest economy, boasting a GDP…
What Does the Industry Report Say About Generative AI?

2024年10月30日

What Does the Industry Report Say About Generative AI?

Introduction Generative AI is booming. According to the latest industry report, the sector has witnessed a 53.
How to Prepare for Microsoft Azure Solutions Architect Certification Exams

2024年10月15日

How to Prepare for Microsoft Azure Solutions Architect Certification Exams

Becoming a Microsoft Certified: Azure Solutions Architect Expert can significantly boost your career in cloud…
Did Ancient India Invent Flying Machines and Zero? The Truth Behind Aryabhata and Lost Technologies

2024年10月14日

Did Ancient India Invent Flying Machines and Zero? The Truth Behind Aryabhata and Lost Technologies

Let me tell you a story about the rich history of India, a land where ideas, innovations, and spirituality flourished…

3 条评论

See all articles

How to build your own deep learning infrastructure

Aritra Ghosh

Founder at Vidyutva | EV | Solutions Architect | Azure & AI Expert | Ex- Infosys | Passionate about innovating for a sustainable future in Electric Vehicle Ecosystem and AI

The use case:

An example:

Infrastructure:

Software

领英推荐

Hardware

Provisioning

Orchestration:

kubernetes-ec2-autoscaler

Cloud Hacking for Startups

4,689 位关注者

Aritra Ghosh的更多文章

社区洞察

其他会员也浏览了

Facebook is Making Deep Learning Experimentation Easier With These Two New PyTorch-Based Frameworks

Deep Learning Economies of Scale: How Real-World Computer Vision is Changing

DEEP LEARNING SUMMIT, SAN FRANCISCO: What to Expect

Turn Your Deep Learning Model into a Serverless Microservice

What Is Deep Transfer Learning and Why Is It Becoming So?Popular?

Moving towards Deep Learning

Deep learning vs. physics-informed grid search

Visual Question Answering Problems: Reasoning With Deep Learning

How to Learn Deep Learning the Hard Way

WHY DEEP LEARNING IS REQUIRED?

The use case:

An example:

Infrastructure:

Software

领英推荐

Hardware

Provisioning

Orchestration:

kubernetes-ec2-autoscaler

Cloud Hacking for Startups

4,689 位关注者

Aritra Ghosh的更多文章

Understanding MCP: Model Context Protocol for LLMs

DeepSeek's Model Distillation Technique : Understand and Implement

Impact of Trump-Era Tariffs on U.S. Industries Relying on Indian Imports

Implementing Hub-and-Spoke Architecture with Azure Databricks

What are the Challenges Faced by Organizations in Executing AI & Data Projects?

Azure Data Engineering Cheat Sheet

Can India Achieve Exponential Economic Growth?

What Does the Industry Report Say About Generative AI?

How to Prepare for Microsoft Azure Solutions Architect Certification Exams

Did Ancient India Invent Flying Machines and Zero? The Truth Behind Aryabhata and Lost Technologies

社区洞察

其他会员也浏览了

Facebook is Making Deep Learning Experimentation Easier With These Two New PyTorch-Based Frameworks

Deep Learning Economies of Scale: How Real-World Computer Vision is Changing

DEEP LEARNING SUMMIT, SAN FRANCISCO: What to Expect

Turn Your Deep Learning Model into a Serverless Microservice

What Is Deep Transfer Learning and Why Is It Becoming So?Popular?

Moving towards Deep Learning

Deep learning vs. physics-informed grid search

Visual Question Answering Problems: Reasoning With Deep Learning

How to Learn Deep Learning the Hard Way

WHY DEEP LEARNING IS REQUIRED?