How to build your own deep learning infrastructure

How to build your own deep learning infrastructure

Deep learning is a type of machine learning that uses large amounts of data and algorithms to build models that can learn and make predictions. Good infrastructure is crucial for success in deep learning, but thanks to open-source tools, anyone can build their own deep learning infrastructure. This has made the field more accessible and has contributed to the recent progress in the field.

In this article, we’ll discuss how deep learning research usually proceeds, describe the infrastructure choices we have to make to support it, and open-source?kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes. I hope you find this article useful in building your own deep learning?infrastructure.

The use case:

A common process in deep learning involves starting with an idea and testing it on a small problem to see if it works. This allows researchers to quickly run a variety of experiments to see which approaches are most effective. To do this efficiently, they need to be able to easily access computing resources and run their experiments without encountering barriers.

Once a deep learning model has been tested and shown to be effective on a small scale, the next step is to try to improve it by pushing it to its limits and finding ways to overcome its limitations. This involves running the model repeatedly to see how it behaves under different conditions, and then making changes and adjustments to improve its performance. This process is similar to building any other type of software system, and it requires iterative experimentation and testing to identify and fix any issues.

Deep learning infrastructure must enable users to easily inspect and analyze their models, in order to understand how they are working and identify any potential issues. Providing only summary statistics is not sufficient, as it does not give users the detailed insights they need to improve their models.

Once a deep learning model has shown promise on a small scale, the next step is to scale it up to larger datasets and more computing resources. This typically involves running longer, more complex experiments that can take multiple days to complete. Careful experiment management and thoughtful hyperparameter selection are crucial in this phase, as they can significantly impact the quality of the results.

The early research process in deep learning is often unstructured and rapid, while the later stages are more methodical and time-consuming. However, both stages are important for achieving a great result.

An example:

The paper "Improved Techniques for Training GANs" describes several techniques for improving the training of generative adversarial networks (GANs), which are a type of machine learning model that involves two competing neural networks. The generator network tries to create fake data that is similar to the real data, while the discriminator network tries to identify which data is real and which is fake. A successful generator network is one that can consistently fool the discriminator network.

However, GANs have a potential failure mode known as "collapse," where the generator network always outputs the same sample, even if it is a realistic-looking one. The paper discusses a technique for addressing this issue and improving the performance of GANs.

The "Improved Techniques for Training GANs" paper describes a technique for addressing the issue of GAN collapse, where the generator network always outputs the same sample. The technique involves giving the discriminator network an entire minibatch of samples as input, rather than just a single sample. This allows the discriminator to detect when the generator is producing the same sample over and over, and sends gradients back to the generator to correct the problem.

The technique was initially tested on the ?MNIST?and?CIFAR-10 datasets, which allowed for rapid prototyping and iteration. The results on CIFAR-10 were particularly promising, with the best samples seen on the dataset.

However, in order to truly solve the problem and be useful, deep learning algorithms need to be scaled up to work on larger datasets such as ImageNet. The paper's author, Ian Goodfellow, then focused on scaling the model up to work on ImageNet.

No alt text provided for this image
openAI model learning to generate ImageNet images

To train a deep learning model on a large dataset like ImageNet, it is necessary to use multiple GPUs in parallel. This allows the model to process more data faster, but also requires careful experiment management to ensure that each experiment is as efficient as possible. The training process can still take many days even with multiple GPUs, so it is important to carefully log the results of each experiment and use that information to make informed decisions about how to improve the model.

While the results of the experiments described in the paper were good, they were not as good as the authors had hoped. This is not unusual in scientific research, as there are often many unknowns and unpredictable factors that can impact the outcome of an experiment. The authors continued to test different hypotheses and make adjustments to try to improve the performance of the model, but ultimately were not able to achieve the desired results.

Infrastructure:

Software

No alt text provided for this image
A sample of our TensorFlow code

The majority of the research code used by openAI is written in Python, using libraries such as TensorFlow and Theano for GPU computing, and Numpy for CPU computing. Some researchers also use higher-level frameworks like Keras on top of TensorFlow.

Like many other researchers in the deep learning community, the authors use Python 2.7 and the Anaconda distribution, which includes convenient packaging for libraries such as ?OpenCV? and offers performance optimizations for some scientific libraries.

Hardware

In an ideal situation, doubling the number of nodes in a computing cluster will halve the runtime of a batch job. However, in deep learning, the speedup from using multiple GPUs is often less than linear, meaning that the benefit of adding more GPUs is not as great as one might expect. This means that to achieve the best performance, it is necessary to use high-quality GPUs.

In addition to using GPUs for deep learning, the authors also make use of CPUs for running simulators, reinforcement learning environments, and small-scale models that do not benefit from being run on a GPU. This highlights the importance of having a well-rounded and flexible computing infrastructure for deep learning research.

No alt text provided for this image
nvidia-smi showing fully-loaded Titan Xs

The authors of the "Improved Techniques for Training GANs" paper received a donation of compute resources from Amazon Web Services (AWS), which they use for CPU instances and for horizontally scaling up GPU jobs. They also run their own physical servers, primarily equipped with Titan X GPUs. The authors expect to use a hybrid cloud model for the long term, as it allows them to experiment with different GPUs, interconnects, and other technologies that may be important for the future of deep learning. This approach allows them to stay at the forefront of the field and continue making progress in their research.

No alt text provided for this image
htop on the same physical box showing plenty of spare CPU. We generally run our CPU-intensive workloads separately from our GPU-intensive ones.

Provisioning

The computing infrastructure as a product that must be user-friendly and easy to use. To achieve this, they use a consistent set of tools to manage all of their servers and ensure that they are configured as similarly as possible. This approach simplifies the process of setting up and running experiments, and allows researchers to focus on the science rather than the technical details of the infrastructure. By treating their infrastructure like a product, the authors are able to support the rapid, iterative experimentation that is essential for progress in deep learning.

No alt text provided for this image
Snippet of our Terraform config for managing Auto Scaling groups. Terraform creates, modifies, or destroys your running cloud resources to match your configuration files.

openAI use Terraform to set up their cloud resources on AWS, including instances, network routes, and DNS records. The cloud and physical nodes run Ubuntu and are configured using Chef. To speed up the process of setting up new nodes, they use Packer to pre-bake cluster AMIs. The clusters use non-overlapping IP ranges and are interconnected over the public internet using OpenVPN on user laptops, and strongSwan on physical nodes.

openAI use a combination of NFS (on physical hardware), EFS, and S3 to store people's home directories, data sets, and results. This allows them to easily access and share data across their computing infrastructure.

Orchestration:

Scalable infrastructure can sometimes make simple tasks more difficult, so the authors of the "Improved Techniques for Training GANs" paper put equal effort into developing tools and processes for small- and large-scale jobs. They are also working on making distributed use cases as accessible as local ones, to support the flexible and iterative experimentation that is essential for deep learning research.

To support ad-hoc experimentation, the authors provide a cluster of SSH nodes with and without GPUs. They also use Kubernetes as their cluster scheduler for physical and AWS nodes. The cluster spans three AWS regions to ensure that they have enough capacity to handle their workload.

Kubernetes requires that each job be packaged as a Docker container, which provides dependency isolation and code snapshotting. However, building a new Docker container can add extra time to a researcher's iteration cycle, so the authors also provide tools to allow researchers to easily ship code from their laptops into a standard image. This speeds up the experimentation process and allows researchers to focus on the science.

No alt text provided for this image
Model learning curves in TensorBoard

openAI expose Kubernetes's flannel network directly to researchers' laptops, allowing them to easily access the network and their running jobs. This is particularly useful for accessing monitoring services like TensorBoard. The authors initially used a different approach that provided stricter isolation, but found that it added too much friction and made it difficult for researchers to access the services they needed. By exposing the flannel network directly, they were able to provide a simpler and more user-friendly experience for researchers.

kubernetes-ec2-autoscaler

open AI have a workload that is bursty and unpredictable, with some experiments requiring many more resources than others. This makes it challenging to manage their computing infrastructure, as they need to be able to quickly provision new resources as needed.

To address this issue, the authors use Kubernetes to manage their cloud infrastructure, and use Auto Scaling groups to dynamically provision new Kubernetes nodes. However, they found that it was difficult to correctly manage the size of the Auto Scaling groups, as AWS's Scaling Policies did not always provide the level of control they needed. To solve this problem, the authors developed kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes that allows them to more easily manage the size of their Auto Scaling groups.

While it may be tempting to use raw EC2 for large batch jobs, the authors found that the Kubernetes ecosystem provided many benefits, such as low-friction tooling, logging, monitoring, and the ability to manage physical nodes separately from the running instances. As a result, they chose to focus on making Kubernetes autoscale correctly rather than rebuilding their infrastructure on raw EC2.

No alt text provided for this image
The Launch Configurations for our Kubernetes cluster

The kubernetes-ec2-autoscaler works by polling the Kubernetes master's state to determine the current resource ask and capacity of the cluster. If there is excess capacity, it drains and ultimately terminates the relevant nodes. If more resources are needed, it calculates the number and type of servers that should be created and increases the size of the relevant Auto Scaling groups.

The autoscaler can handle multiple Auto Scaling groups, and can take into account resources beyond just CPU (such as memory and GPUs). It also supports fine-grained constraints on jobs, such as the AWS region and instance size. Additionally, it can handle overflow to a secondary AWS region in cases where the primary region reaches its capacity.

Overall, the plan is to focus on maximizing the productivity of deep learning researchers by providing a robust and user-friendly infrastructure. They continue to develop and improve their tools and workflow, and welcome contributions from others in the community.

Swati Bharti

Digital Marketer

1 年

This is a great introduction to machine learning infrastructure! I found an article that dives deeper into this topic: https://aitech.studio/aie/machine-learning-infrastructure/

回复

要查看或添加评论,请登录

Aritra Ghosh的更多文章

社区洞察

其他会员也浏览了