USENIX OpML '20 - Session 7 - Model Training
Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving real problems, hosted on the USENIX OpML Slack Workspace channel for our Ask-Me-Anything session with the authors. It will be Thursday, August 6 from 9am - 10:30am, PDT. To join, just join the free slack workspace above and go to the channel!
Both the complexity of machine learning models and the raw amount of data from which they are trained are constantly increasing. In this session, we will have presentations ranging from managing the raw Spark and Kubernetes infrastructure to training the models at Intuit, Bayesian optimization approaches for continuously improving your model training efficiency over time from SigOpt, and insights gleaned at Google from fifteen years of model training and model execution outages.
How ML Breaks: A Decade of Outages for One Large ML Pipeline
Daniel Papasian and Todd Underwood, Google
Reliable management of continuous or periodic machine learning pipelines at large scale presents significant operational challenges. Using experience from almost 15 years of operating some of the largest ML pipelines, we examine the characteristics of one of the largest and oldest continuous pipeline at Google. We look at actual outages experienced and try to understand what caused them.
We examine failures in detail, categorizing them into ML vs Non-ML and Distributed vs. Non-Distributed. We demonstrate that a majority of the outages are not ML-centric and are more related to the distributed character of the pipeline.
SPOK - Managing ML/Big Data Spark Workloads at scale on Kubernetes
Nagaraj Janardhana and Mike Arov, Intuit
At Intuit, customer data sets are growing exponentially with the growth of the business and the capabilities offered. We built an elastic platform SpoK (Spark on Kubernetes) to run Jupyter notebooks, Data Processing, Feature Engineering, distributed training jobs, batch model inference and model evaluation workflows on Spark using Kubernetes as the resource manager.
With the whole organization moving to Kubernetes for running the services workload, we saw an opportunity to run the ML workloads as well on Kubernetes for simplified management of the cluster operations, bring the goodness of containers to data processing, scalable infrastructure, cost and efficiency improvements and also to reuse the CI/CD, security certification tooling already built. This migration from EMR/Yarn to Kubernetes has improved the developer productivity by reducing the time to deploy from more than 7 days to less than a day. Provided cost improvements in the range of 25~30%. Eased Cluster Operations Management as all types of workloads share the same cluster.
Addressing Some of Challenges When Optimizing Long-to-Train-Models
Tobias Andreasen, SigOpt
As machine learning models become more complex and require longer training cycles, optimizing and maximizing performance can sometimes be seen as an intractable problem - this tends to leave a lot of performance unrealized.
The challenge oftentime becomes that most common methods for hyperparameter optimization are either sample efficient or they are able to efficiently parallelize. This either leads to a choice between a very long optimization process with good performance, or a very short but efficient optimization process with suboptimal performance.
Further, another challenge becomes justifying the cost of optimizing these oftentime long-to-train-models, because in most situations this has to be done on a per model basis with non of information gained being leverage in the future.
This talk outlines ways in which these challenges can be addressed, when thinking about bringing optimal performing models into production.
Please join us at the session!
Joel Young and Nisha Talagala, USENIX OpML '20 Co-Chairs