USENIX OpML '20 - Session 7 - Model Training

Joel Young

ML Infrastructure | Gen AI, Leadership

发布日期: 2020年7月25日

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving real problems, hosted on the USENIX OpML Slack Workspace channel for our Ask-Me-Anything session with the authors. It will be Thursday, August 6 from 9am - 10:30am, PDT. To join, just join the free slack workspace above and go to the channel!

Both the complexity of machine learning models and the raw amount of data from which they are trained are constantly increasing. In this session, we will have presentations ranging from managing the raw Spark and Kubernetes infrastructure to training the models at Intuit, Bayesian optimization approaches for continuously improving your model training efficiency over time from SigOpt, and insights gleaned at Google from fifteen years of model training and model execution outages.

How ML Breaks: A Decade of Outages for One Large ML Pipeline

Daniel Papasian and Todd Underwood, Google

Reliable management of continuous or periodic machine learning pipelines at large scale presents significant operational challenges. Using experience from almost 15 years of operating some of the largest ML pipelines, we examine the characteristics of one of the largest and oldest continuous pipeline at Google. We look at actual outages experienced and try to understand what caused them.

We examine failures in detail, categorizing them into ML vs Non-ML and Distributed vs. Non-Distributed. We demonstrate that a majority of the outages are not ML-centric and are more related to the distributed character of the pipeline.

SPOK - Managing ML/Big Data Spark Workloads at scale on Kubernetes

Nagaraj Janardhana and Mike Arov, Intuit

At Intuit, customer data sets are growing exponentially with the growth of the business and the capabilities offered. We built an elastic platform SpoK (Spark on Kubernetes) to run Jupyter notebooks, Data Processing, Feature Engineering, distributed training jobs, batch model inference and model evaluation workflows on Spark using Kubernetes as the resource manager.

With the whole organization moving to Kubernetes for running the services workload, we saw an opportunity to run the ML workloads as well on Kubernetes for simplified management of the cluster operations, bring the goodness of containers to data processing, scalable infrastructure, cost and efficiency improvements and also to reuse the CI/CD, security certification tooling already built. This migration from EMR/Yarn to Kubernetes has improved the developer productivity by reducing the time to deploy from more than 7 days to less than a day. Provided cost improvements in the range of 25~30%. Eased Cluster Operations Management as all types of workloads share the same cluster.

Addressing Some of Challenges When Optimizing Long-to-Train-Models

Tobias Andreasen, SigOpt

As machine learning models become more complex and require longer training cycles, optimizing and maximizing performance can sometimes be seen as an intractable problem - this tends to leave a lot of performance unrealized.

The challenge oftentime becomes that most common methods for hyperparameter optimization are either sample efficient or they are able to efficiently parallelize. This either leads to a choice between a very long optimization process with good performance, or a very short but efficient optimization process with suboptimal performance.

Further, another challenge becomes justifying the cost of optimizing these oftentime long-to-train-models, because in most situations this has to be done on a per model basis with non of information gained being leverage in the future.

This talk outlines ways in which these challenges can be addressed, when thinking about bringing optimal performing models into production.

Please join us at the session!

Joel Young and Nisha Talagala, USENIX OpML '20 Co-Chairs

要查看或添加评论，请登录

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

2020年7月25日

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

Join us for the final OpML '20 session on bias, ethics, and privacy from the perspective of operational machine…

3 条评论
USENIX OpML '20 - Session 6 - Applications and Experiences

2020年7月25日

USENIX OpML '20 - Session 6 - Applications and Experiences

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving…

1 条评论
USENIX OpML '20 - Session 5 - Model Deployment Strategies

2020年7月25日

USENIX OpML '20 - Session 5 - Model Deployment Strategies

Join us for the OpML '20 session on model deployment strategies for operational machine learning, hosted on the USENIX…

4 条评论
USENIX OpML '20 - Session 4 - Algorithms

2020年7月24日

USENIX OpML '20 - Session 4 - Algorithms

Join us for the OpML '20 session on algorithms for operational machine learning, hosted on the USENIX OpML Slack…
Features, Explainability, and Analytics OpML '20 Session 3

2020年7月21日

Features, Explainability, and Analytics OpML '20 Session 3

Join us for the OpML '20 session on Features, Explainability, and Analytics, hosted on the USENIX OpML Slack Workspace…

1 条评论
Joel's Cashew Pesto and Doogh

2019年7月27日

Joel's Cashew Pesto and Doogh

Got basil? Got cucumbers? Here's something #yummy I made up. I don't get much coding time anymore, but I can still use…

7 条评论
How the Experts Do It: Production ML at Scale

2019年6月7日

How the Experts Do It: Production ML at Scale

Machine learning is driving virtually every major online service we use. In this panel, top experts from across the…

4 条评论
Support Traps — A cautionary tale for infrastructure engineers

2019年1月12日

Support Traps — A cautionary tale for infrastructure engineers

BLUF: Avoid the support trap — a kind of success trap many platform engineering teams experience. In 2016, I started…

31 条评论

See all articles

How ML Breaks: A Decade of Outages for One Large ML Pipeline

SPOK - Managing ML/Big Data Spark Workloads at scale on Kubernetes

Addressing Some of Challenges When Optimizing Long-to-Train-Models

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

USENIX OpML '20 - Session 6 - Applications and Experiences

USENIX OpML '20 - Session 5 - Model Deployment Strategies

USENIX OpML '20 - Session 4 - Algorithms

Features, Explainability, and Analytics OpML '20 Session 3

Joel's Cashew Pesto and Doogh

How the Experts Do It: Production ML at Scale

Support Traps — A cautionary tale for infrastructure engineers

社区洞察