How the Experts Do It: Production ML at Scale

Joel Young

ML Infrastructure | Gen AI, Leadership

发布日期: 2019年6月7日

Machine learning is driving virtually every major online service we use. In this panel, top experts from across the industry will discuss how they have learned to scale machine learning and its use in solving real-world problems. Come and learn strategies for managing the fast evolution of technologies, get insights into how deep learning is changing the serving game from productionizing large models and using GPUs, learn how these companies keep their incredibly complicated serving stacks operable 24x7, and the dimensions of scale that they worry about—dimensions ranging from the raw queries-per-second flowing through their systems, the growing size and complexity of the models, to the number of users across engineering building and fielding them. We’ll conclude with a discussion of how these experts measure success.

USENIX OpML 2019

On 20 May 2019 in Santa Clara California, USENIX hosted OpML '19 the 2019 USENIX Conference on Operational Machine Learning. The conference focused on what it takes to build and operate the machinery behind successful production-grade machine learning. In their words: "research advances and cutting edge solutions to the pervasive challenges of ML production lifecycle management." It featured a bunch of great talks including an exciting key note by Michael Jordan on "Ray: A Distributed Framework for Emerging Ai Applications."

I, along with five of my industry colleagues, had the great honor of participating in a panel discussion discussing our approaches for Operational Machine Learning at massive scale. Please find a video of our discussion and a complete transcript of the hour-long session. Please note that the grammar is what real speaking looks like when experts are thinking deeply while sharing these thoughts. The text has only been corrected to remove verbal ticks and to add punctuation. I have taken the liberty to emphasize (bold, italics) key points from each panelist. Also, each speaker transition is marked by the offset into the video and the speaker's first name so you can index into the video.

The Panelists

Host

Nisha Talagala, CEO at Pyxeda AI and Co-Chair, OpML ‘19.

Moderator

Joel Young, Director, Machine Inference Infrastructure, LinkedIn

Panelists

Sandhya Ramu, Director, AI SRE, LinkedIn
Andrew Hoh, Product Manager, ML Infra and Applied ML, Airbnb;
Aditya Kalro, Engineering Manager, AI Infra Services and Platform, Facebook;
Faisal Siddiqi, Engineering Manager, Personalization Infrastructure, Netflix;
Pranav Khaitan, Engineering Manager, Personalization and Dialog ML Infra, Google

Videographer

Asia Cao, Senior Software Engineer, Relevance Explains, LinkedIn

The Panel

Introduction

Natural, ungrammatical speaking begins here!

00:00 - Nisha

The panel will be led and moderated by Joel Young of LinkedIn! Welcome!

00:09 - Joel

Thank You Nisha! I'm Joel -- I lead the Machine Inference Infrastructure team at LinkedIn. You hear a lot of discussions about machine learning. My team doesn't do machine learning. We do machine doing. We have plenty of people doing the machine learning. I also am part of the leadership team for LinkedIn's Productive Machine Learning initiative, rebuilding our entire machine learning stack.

I am joined today by five awesome leaders from across the industry. Please introduce yourselves:

00:47 - Pranav

I am Pranav from Google AI. I lead the teams working on conversational AI and personalization AI technologies. We are a group of scientists and engineers -- about fifty of us.

We do both the research and then develop the vertical specific infrastructure for these domains. These technologies help power various products including Android, Search and Chrome.

01:12 - Andrew Hoh

The ML infrastructure team builds foundational infrastructure for ML use cases throughout Airbnb. And the applied ML team looks to build foundational models that can be leveraged throughout Airbnb, sometimes, as well, the foundational features.

01:26 - Faisal

Hi! I’m Faisal. I lead personalization infrastructure at Netflix. My team is part of a larger applied machine learning org focussed on one core element of the Netflix product which is personalization, search, and messaging -- so content discovery -- for our members.

My team is sort of a localized-centralized team in the sense that we are focused on one large vertical, but, within that, various focus areas we look for horizontal leverage and so we can benefit from the diversity and the commonality between various use cases.

Happy to be here.

02:06 - Sandhya

I’m Sandhya Ramu. I lead the offline data and AI SRE Teams at LinkedIn. My team is responsible for the grid infrastructure, like Hadoop, Spark, Presto, and so on. But also the data movement, both egress and ingress from all different sources. Also the AI platforms -- that are online serving and member facing -- such as search and cloud and so on and so forth.

So excited to be here and thanks for having me.

02:47 - Aditya

I’m Aditya Kalro. I’m an Engineering Manager of the AI Infrastructure team at Facebook. And we build a bunch of the infrastructure and tooling to make machine learning engineers more productive -- everything from feature engineering, data pre-processing, training, evaluation, and inference cloud so that you can push your models to production more quickly.

We also have a layer on top of all of this to make the entire process and glue it together and also lineage across these that you can actually tell which data a model is trained on. This becomes very useful -- I’m pretty sure we’re gonna end up talking about some of this.

I’m excited to be here. Thank you for having me!

03:17 - Joel

Thank you all.

So the way we're gonna structure this is we're going to talk for a chunk of time and we have about 10 to 15 minutes for questions from all y'all. So if something comes to mind maybe take a quick note and we're gonna have a big chunk of time to talk about.

As you think about things, we've got a few different kinds of people on the panel:

Andrew is a product manager so as you as your thinking about the things and if you have questions “How do you think about things from the product perspective versus the engineer, builder, etc. perspective.” please put them on the table.

Sandhya is our only SRE representative. I know that her and her team have saved me -- or more accurately saved my customers -- many times when we make mistakes. Keep that in mind also as you're thinking about the questions and perspectives that you want to dig into later on.

So, speaking of that.

Who are our customers?

One of the key questions that drives how we build, what we build, what our job is, building out the ML infrastructure and keeping it operable is exactly who are our customers. Thinking about whether they're ML engineers, whether they're data scientists, whether they're all the engineers in the company. Who you're building for drives what you build.

Andrew, as a product manager how do you think about this? Who are your customers?

04:58 - Andrew

Great question!

So at Airbnb we really have many different personas that we look for -- who are our customers from ML infrastructure side. We have the ML engineers: these are usually people who can build their own infrastructure, who understand how to build applications well, understand what it really takes through the entire software development lifecycle.

We also have data scientists. This I know is a terminology that's relatively different across which company you work for. For data scientists at Airbnb it's less code based more towards doing research and developing a lot of different models and it spans across from something to scientists who focus on analytics. Something that we focus on through analytics, building dashboards and then other data scientists who build up models and a lot of these models could eventually be used in production.

And then the last one that really comes up which isn't talked [about] as much is the application engineers as well.

And so, for our infrastructure a lot of the core users would be the ML engineers and data scientists. Data scientists develop a lot of models and ML engineers may sometimes develop models but they also do some of the work to implement it back into the infrastructure or the production service. If a data scientist wants to pursue building a model they actually pair up with an application engineer who can then take the results of the model and then plug it back into the system.

So we find that the number of users and the archetypes of different users for our ML platform really varies. And so, we always have to consider this when we are building out new features so that we can make sure that each different persona is well thought out and make sure that they all have a great experience.

06:49 - Joel

Faisal?

06:51 - Faisal

A lot of what you've said resonates with me.

Similarly we have research scientists as well as the application engineers, the ML engineers. One thing that is slightly different in the Netflix model at least in the vertical that I focus on, is there's a sort of a hybrid of what we call the algorithm engineers where they have the chops of research scientists -- understand the math and the statistics behind developing models but they're also trained on productizing things.

So having them as our customers informs us on the kind of frameworks and libraries that we have, that we build. And so for instance things like good software engineering design, API extensibility, all of those things are important because we are essentially targeting to a set of customers who are equal part software engineers as well as research scientists. So that helps.

07:50 - Joel

Awesome! Aditya, at Facebook, who are you building for?

07:52 - Aditya

That depends on which team you are on. What ends up happening is that we have a bunch of things that we're making -- that we're tying together. They're make it very easy to go from one stage to the next.

I think what Andrew was talking about in the research to production side is exactly the kind of thing that we want to do with Facebook AI is more from the product point-of-view. We’re trying to make it easy so it's point-and-click. For training the model, you point it to your data, it will train the model based on the research the field has already done. And again it's a one button push to the inference system.

It doesn't always work like that. But, most of the time if you're doing something inside a closed box use case it works. Application engineers typically find this a lot easier to get from one stage to the next stage if we provide the right tool.

So what you were talking about, we've talked about Pro-ML in different scenarios but that's what we're aiming for as well.

08:52 - Joel

Awesome. Thank you!

The SRE Perspective and Operability

Sandhya is here on the panel. So, Sandhya, as we build this stuff what do you wish that the four of us -- the five of us builders actually would keep in mind? What is important when keeping the systems operable?

09:15 - Sandhya

Yeah. That’s a good question. There's a long list but let’s whittle it down to a few.

The first one would be understanding the decision point when you will move from experimentation to production because oftentimes there's a long lead time before we can take things to production. Working with the platform owners early on in the development lifecycle to give us enough lead time and as well heads up around planning things around infrastructure is definitely one thing.

The second thing would be, oftentimes what happens is people build flows and put things in production but they don't go back and retry their flows, either because the product priorities changed or people have moved on to different teams. There's a ton of things that lies around in production which may not necessarily be adding value due to various reasons. I think having the craftsmanship initiative to go back and revisit some of these things and confirm the need for having these things in production and be proactive about getting rid of things that are not necessarily necessary anymore is the other thing I would highlight.

The third thing is around capacity management in general. It's a joint ownership between platform owner and customer. I think there's a ton of metrics that not only make it visible from cost-to-serve standpoint, also the efficiency standpoint. For the platform users, how do we ensure there is accountability as a joint team and keeping utilization to the most optimal way.

I think those are the probably the three things I would highlight. And aspire that all the operable aspects of it are built early on.

11:19 - Joel

Thanks You!

11:20 - Faisal

If I may just jump in to that.

I think the last point you mentioned resonates a lot with me. There's joint ownership of cost and obviously there's no infinite resources but oftentimes we run into challenges where research scientists and models are hungry for data and therefore compute power and so they often want to be able to do the most -- the most with the most resources possible.

But when you talk to the product managers, you talk to the research scientists; we often find that not everything has to be at the same high priority. For instance if you're trying to do a backfill of a new model that it's just early exploration, it's okay to let the backfill run for five days as opposed to one day if it's a lot of data. And that would have a substantial difference in the amount of resources that it may use.

So having those conversations oftentimes helps and if you can have good metrics across dashboards to point out why some of the leader boards or shame boards are helpful.

12:25 - Joel

Nice. Pranav, can you talk a little bit about your team's perspective on operability?

12:29 - Pranav

I think operability is, especially when it comes to ML systems, important because people expect these to deliver results that are both explainable and predictable. By people I mean both the engineers that are using this infrastructure but also the end-users. And it really becomes important that we support it in a fashion where we can repeat those results in a predictable fashion and with the quality that the end-users are expecting.

13:11 - Joel

Great. Thank you.

Containerization and Deep Learning

One of the things that is transitioning in this industry, and different companies are in different places in the transition, is the role of containerization and the role of deep learning in what we build. And, interestingly, with things like Kubeflow and stuff like that, these two are not independent of each other.

So in this question, Aditya, can you talk a little bit about how this is impacting what you are building?

13:46 - Aditya

So, at Facebook we used homegrown containers and one of the things that did happen when we went from a non-containerized system to a containerized system, reliability improved, scalability improved. We could end up co-locating a lot of the work that we were doing across the entire fleet of machines we had.

In general the reason that we used or the reason that we told everybody that we would is because we wanted more security. We wanted to be able to make sure that we couldn't: One, we couldn't get resources outside of what was supposed to be allocated; but also, we couldn't get data outside of what was supposed to be available. And that was one of the major reasons for containerization. It's helped us on a variety of levels including exposing new types of hardware to models -- sorry -- to training right now. And we're expecting to continue on that part.

We use different containerization mechanisms for inference and for training. I'm hoping that as our continuation evolves and catches up with with Kubernetes and Kubeflow, we’ll actually be able to use the rest of the benefits that containerization can provide. But I feel like we're still a little further behind in the game because we're using our own codebase.

15:12 - Joel

Cool. My understanding is in FBLearner flow you're heavily engaged with Caffe2 and PyTorch of course.

15:21 - Aditya

Yes!

15:22 - Joel

How do these models flow into production? Is it using a containerization framework, or?

15:26 - Aditya

Yes it is. It's actually using a packaging and containerization framework. What ends up happening, because of the way that Facebook is set up, when Flow, when FBLearner Flow actually tries to create a mechanism for training, it builds the package with Caffe2 or PyTorch along with whatever other code there is to actually do the training. And then we ship that out. The one advantage that we see in this is that we don't have to deal that much with versioning but the one disadvantage that we see with it is that the container size gets very very very large.

So, there are advantages and disadvantages to both of them. And we chose one to make our life easier so we can actually move faster on shipping changes to PyTorch and Caffe2

16:12 - Joel

Thank You!

Pranav, how are you guys thinking about it at Google?

16:15 - Pranav

Yeah. So at Google, internally, we use this system called Borg, which has inspired Kubernetes, which is widely used both in Google Cloud and beyond. For ML specifically, I think containerization or something [like] Kubernetes is really important because ML, as we know, tends to be very resource intensive -- especially the research side of ML -- and that means resources aren't very predictable. Like if I do some training research, it might happen in bursts over a few days as opposed to consistently over weeks or months. And being able to support that in a fashion where you can use those resources on demand.

I think containerization or something like Kubernetes makes it much easier. You can tap into the residual resources across lots of data centers and lots of machines. And I think that has significantly increased the amount of ML research and development that we see -- as opposed to the old model of dedicated machines and stuff like that.

17:31 - Joel

Yes. One of the things that I found kind of fascinating is sometimes there's this feedback loop between what we need for ML infra and deployability and the technology stack. And in some companies, it seems that containerization across the company is driving how we think about our ML infra. In other circumstances, like for example at LinkedIn with Kubernetes, it's what ML infra needs that's driving where we're going with the containerization technologies.

Sandhya, can you talk a little bit about it?

18:11 - Sandhya

Yeah absolutely. I think, like Pranav mentioned earlier, there's a big demand for containerization primarily because the older way of waiting for dedicated clusters, for your own queue, the old use cases won’t scale. And it is not agile to quickly turn around.

I think having a bunch of Kubernetes clusters, both for CPUs and GPUs, is the way to go where we quickly spin up containers with the aspect of experimentation which goes really fast. And it also gives a lot of autonomy for the ML developers so they can iterate faster rather than waiting for some of these libraries to be available in the larger clusters -- which takes much longer to go through the life cycle of roll-out to production.

That's the primary driver at LinkedIn. We are still tapping into it; very initial phases but we have a long way to go before we can make it accessible to every single developer out there who uses machine learning or aspires to do it. And think Pro-ML is also another huge initiative that is pushing across the company, not just the AI team at LinkedIn, so looking forward to go back on that front.

19:32 - Joel

Thank you.

Cost-to-serve vs. agility

Early on in my time at building the infra, we had this ten years of logistic regression rolling every recommender system and, maybe as a result of this, our systems have become extremely tuned. And some of the teams were even going so far as taking their models apart, looking at the inner product that's used to drive logistic regression, doing partial sums caching on the components, separately applying the link function to the scoring, in order to get every last drop of cost-to-serve.

This was awesome, really really efficient, and completely not agile. It would take quarters to change anything in how the models work. Because it had to be worked down to the lowest layers of the infra.

Andrew, as you're looking at designing the products, how do you think about balancing things like cost-to-run, cost-to-serve, cost-to-train, versus agility for your customers?

20:37 - Andrew

A great question. I think the key thing, and how we've kind of wrapped our heads around cost, is really building the visibility for our downstream users. It's tough for infrastructure teams to be the owners of cost when it's hard to have visibility into all the downstream use cases using our infrastructure. And so, it comes down to building a lot of trust with users and allowing them to have the visibility into how much cost they're running. And making sure that they're doing the cost-benefit analysis to make sure that their models that they run are actually beneficial to the company.

One thing that really stuck with me from a co-worker of mine: He mentioned that one of the key things with cost is always baselining first and the one that he talks about is, before he joined Airbnb, he was at this company and they would invest into building a neural network for there search results and after investing about six months to a year they actually found out that a simple popularity sort actually worked a lot better! And so you can imagine that the overall cost of running this, overall cost of training this, overall cost for inference -- nothing even close to that baseline model -- not even a model -- just baseline popularity sort.

I genuinely, from the Airbnb standpoint, recommend to baseline first, understand where you can with something simple, and then, from there, iterate on it and see how much better you can make it, and then do the analysis of if we were to make its precision 5% points better how much more does it deliver to the company? Is it worth it? Especially in ML when, especially, more complex and heavier models take a lot of resources, you really do have to take a look to see if increasing it by 0.5% is worth it or not and do that kind of cost-benefit analysis.

I think, the long-story-short, we as infrastructure try to make it very very clear to our upstream users just what the costs are for their model. And then the shared cost of, let's say, capacity, of having that capacity that's unused -- we build it into how we manage cost for our team and we try to verify that we actually need that staging cluster / test cluster to make sure that there are no production issues. It's a complex formula.

This is kind of how we ended up today. I imagine it'll change in another three to six months as we figure out the other costs that come into play.

23:08 - Joel

Sandhya, did you want to talk about the LinkedIn perspective?

23:10 - Sandhya

Sure. Similar to what Andrew was mentioning, I think, as platform owners, it is very hard for us to be responsible for the cost and primarily understanding the business needs of the various customers within the company -- all the way from sales, bizops,. data sciences, AI teams. Each of those teams have a different use case and I think their needs are very different and need different speeds of delivery. For instance, maybe at LinkedIn, at the moment, our sales insights teams aren't looking to leverage GPUs because they don't need it now; but, I'm fully being aware of they will get there at some point. Planning ahead.

The other aspect is again making both costs as well as efficiency transparent to them. That way, there is a combined ownership and we can go back and drive accountability for the individual organizations. Specifically with AI teams, at the moment today, we do struggle a little bit because AI teams wanna move faster and they want things now; however, there are business costs associated and lead time, so, we try to meet them in the middle. How do we cater to their needs and provide business justifications to our production infrastructure engineering teams? Working with the vendors proactively to make sure they can plan ahead around when they need some of these delivered and the quicker turnaround.

Also exploring Azure cloud, where for the non-standard SKUs can we leverage the cloud rather than relying on the on-prem which is taking much longer. Again it comes down to (though we need to be agile): If we throw GPUs and CPUs at a problem, which is not necessary, you will quickly manufacture scale problems. And being aware that cost-to-serve is a very essential component of agility. And working towards balancing both is the way to go. That's what we aspired to do at LinkedIn.

25:33 - Joel

Thank you.

25:35 - Andrew

So, if I can add on. I forgot to answer the second part of the question which was: So one thing that we try to do in infrastructure is keep things extremely flexible, especially for ML infrastructure, and this is where Joel kind of mentioned “do you fine-tune? But now you're paying the cost long-term of changing really fine-tuned infrastructure which can take six months or longer.” So, at Airbnb we're really taking an approach of being incredibly flexible at this time; because, the ML, kind of I would say the industry and the state of the art, is pretty dynamic at this time. We find that there's a lot of different frameworks out there -- there's new hardware being applied as well whether it's TPUs -- and so overall our current approach is keeping things very very flexible.

If the ML industry does become a little bit more stagnant and static and the methodologies start consolidating into, let's say, a single ML framework or a specific type of hardware and the innovation is slowing down, Airbnb then we'll take a stance of then building in or fine-tuning the models and fine-tuning everything for cost optimization, but, currently at this time, we keep it very flexible. We leverage a lot of docker containers to make sure that people can use the libraries they want to. We're agnostic to ML frameworks -- you don't really impose whether they should go with TensorFlow, or PyTorch, or MXNet. We allow our users to use what they would like to use -- just because Airbnb is also full of Engineers from many different companies who are used to different types of frameworks and it's always hard to force everyone to use a single framework -- and that's always a losing battle.

So, making sure that it's highly flexible and allows users to use what they know best has been really successful for us and that's kind of the foundation for Infra.

27:19 - Aditya

Absolutely. When we started, if you don't know, that was exactly the mechanism that we took. And which is why, going back to the containers thing, having a flexible mechanism letting you to stuff whatever you want in the containers makes it very easy for us to add the frameworks people want to use.

Sharing ML Artifacts

27:37 - Joel

One of the things that is really popular now in a lot of the literature, a lot about at conferences, is deep embeddings. Another thing people are talking about is transfer learning. Now both of these things come with the ideas of: They are leverageable by other teams. In the software world we've figured out some rules of thumb -- some engineering rules. We have things like semver with major numbers, minor numbers, patch versions and if I change my ABI I need to bump my major number version and things like this. For sharing of ML artifacts (features, embeddings, models themselves) things are kind of nebulous. Let's dig on that a little bit.

Aditya?

28:30 - Aditya

So this is a very very long answer to a very very short question. The idea of sharing features is: One, necessary because the amount of ... we didn't actually cover cost-of-storage, cost-of-feature-generation when we were talking about cost-of-serving … that's actually huge. It’s massive. In fact ... the cost of feature engineering, as a process, is sixty percent of an ML engineer’s time. Even if you could reduce that by a little bit -- by sharing features that some other team has done -- that would be a massive amount of savings.

This is just from the perspective of wanting to share; I’m not saying it has to be completely opened. There have to be safeguards in place. And this is where policy style frameworks come into play. Are you actually supposed to be able to use this feature for training? X feature for Y kind of model. These are questions that we start answering.

I think as infra engineers we can build a framework but it’s up to business -- it’s up to applications to actually define what those policies need to be. We just need to make sure it is enforceable throughout our systems.

That’s my kind-of-short answer on how to share features. At least, that’s the philosophy we’re taking at Facebook.

29:57 - Joel

Pranav, how are you thinking about sharing at Google?

30:01 - Pranav

This is a very important area that I believe will be one of the major factors driving the next phase of growth in AI. And the reason is: Today, when most engineers or scientists have to build AI models, they go and start from scratch. “Give me the raw data or give me the problem and I’ll build the model from scratch.” And this is not really sustainable when we have thousands of engineers across a company or across an industry building models.

And then you need to start componentizing out the models. So we need to break it into components that people across the organization can start using. “So I want this subset of components and I will use that.” Some other team might have a different subset that they want to use. And that makes the work that's being done by one team highly useable by several other teams.

This is a very important area. When we talk about components, there are several different types of components: One is the data component itself and even within data there's two parts: There are intermediate data deliverables within the models, things like embeddings, that could then be shared across models. There are the end predictions that the model generates that could then become inputs to other models. The second category is code itself; code or models within models. And, even there, we are seeing a lot of reusability happening -- increasingly happen -- but not as much as is needed.

If we take the same analogy with something like C++ code or Python code, people are writing libraries all the time and then reusing other people’s libraries. I think that same phase of growth is going to happen within ML.

32:00 - Faisal

If I may add. A couple of things come to mind: One, when it comes to sharing features and the computation of features, often, it's a bit of a slippery slope. In the sense that the freshness of a feature, the context, the request context that went into computing a feature may be very pertinent to a particular use case designed at a certain point in time and you can't just arbitrarily pick something up, short of trivial features. That's one of the things that is obviously playing on people's minds.

We had a couple of approaches to address that. One of the things that we've done is we have a common platform library where we allow people to essentially promote features that they believe are in a good generalizable state. And then with that comes the responsibility of making sure that there's good documentation and reproducibility around exactly what is the intent behind this feature. The ability to promote -- but it's a conscious explicit decision as opposed to going to a discovery tool and finding something that you think looks right.

The other opportunity for sharing, is I think what Pranav hinted at, was feature models. When you have a large ensemble there is oftentimes fully developed models that are essentially a well-defined output and their inference results can be used as an input score into say three different types of rankings. And you have a very similar set of models that could go into it as features and the contract there is well-defined and clear in the sense that you know however you created your dataset, you are presenting a prediction and that's easily understandable and that then feeds into other models.

So those are the two things that we've played with.

33:59 - Aditya

And that’s actually really cool. We tried to do that on a smaller scale. The one thing -- we had a bunch of services -- I'd love to hear a little bit later what you -- whose quality-of-service for features, when it’s not the infrastructure team who is the main authority for maintaining it, became an issue for us. So I’d love to hear a little about that.

34:20 - Faisal

It's certainly a challenge. In our case, because of our close partnership with the algorithm engineering team, they do maintain the ownership of the models. We just provide the platform on which they could have those discoverability.

Extensibility

34:36 - Joel

We've touched on it a little bit: One of the things that we're building these platforms has been touched on -- the industry is evolving extremely quickly, new machine learning techniques are coming out every day, new robust platforms are coming out on at least a quarterly basis. As you're thinking about your platforms, how do you keep extensibility? What does extensibility mean for you as you are designing? Andrew, can you talk a little on how you’re thinking about it?

35:10 - Andrew

For extensibility, I think I might have mentioned a little bit before but we really do keep our framework very agnostic to different ML technologies out there; because, we want to make sure that, if you decide to use a new ML framework, you can actually build it back into our infrastructure and so we actually have what we call a [missing word] libraries which you can actually build wrappers around common ML frameworks. And you have to define your serialization and deserialization of it as well as they're fit transform. But, essentially, you can add in any ML framework you want.

We want to make sure that it's open so that if someone is very opinionated that there is an ML framework that works better they can actually have the freedom to add it back into a system. And that also allows them to have their own docker image which they can use and really use that same image for deploying into production. And so, we make sure that we have this immense flexibility so that we can be adaptable to new technologies that emerge, new trends. We always try to stay on top of it to make sure that people have the freedom -- just because ML is constantly moving -- is very dynamic -- and once the infrastructure becomes opinionated -- a lot of our downstream users then have this friction and if there's a lot of friction eventually they'll look towards maybe trying to find another infrastructure they can rely on. And so we want to be as flexible and supportive as we can to our downstream users.

36:45 - Joel

Pranav, can you talk a little bit about the Google perspective?

36:48 - Pranav

I think as Andrew pointed out, extensibility is really important for infrastructure. One of the big reasons is because ML is in such a nascent stage. The use cases themselves are still evolving. If you build some infrastructure today for some use case that we know works today; two years from now the use case would have evolved very significantly. Users would have very different expectations. If I think about features like “smart reply” which we launched a few years ago, it was something back then users didn't expect and it was an add-on feature; but today, people are almost starting to take it for granted and they're expecting the next wave of features that would be driven by it.

"How do you build infrastructure that can anticipate the next few years of product scenarios and scale with those?" That's one part of extensibility. The second part of extensibility is “How do we scale with the number of users or the number of end applications?” And I think a good rule of thumb is most of the systems that we develop today should scale to at least two orders of magnitude growth. When it gets to third order then we probably need to rebuild or redesign the system; but, something that we are building today for some number of users, can it scale to two orders of growth? Both in terms of let's say end users or it could mean the internal number of teams that are using the infrastructure.

So I think those are the two ideas.

38:25 - Joel

Aditya, how are you thinking about that at Facebook?

38:30 - Aditya

So I think I heard a lot of the things that are centered on they're making sure that we can use any of the frameworks that we need to. The one thing that we did slightly differently, which I alluded to earlier, which was a glue layer on top, is something that we call Flume. The idea is to make sure that we have a declarative mechanism of being able to plug in whatever framework you want underneath. And we take care of doing the translation in between models. So you can actually create an ensemble model of PyTorch with Caffe2 with TensorFlow with whatever it is. The idea being that one layer to the next is something that we take care of.

It does run into challenges when we want to push this to inference. If it's Caffe2 or PyTorch, we take it to production easily; if it's something else it becomes a little harder. This has made a bunch, with a lot of resources, it's made a bunch of teams more productive. For example, we have a bunch of integrity teams that can build now on each other's work. They can take a domain from say .. the Field Integrity Team can now reuse something the Messenger Integrity Team tried. And especially with FAIR (Facebook AI Research) this becomes incredibly useful. PyText, which is the new text framework built on top of PyTorch, is becoming incredibly useful because we exposed it you on Flume so a lot of integrity teams can use it, There is a bunch of value in being able to make this happen.

Our definitions of success

39:57 - Joel

Thanks! For the next round we're gonna go through each of the panelists and each of them will dig in a little bit on what success means for them.

So Aditya?

40:11 - Aditya

What success means for us. If I were to make it super simple: One, reliability. With the distributed training, with new types of hardware, with different types of models, and the scale that we're dealing with; we need to make sure that that's top priority. The second is scalability. Distributed training and accelerated hardware is what we're looking at to make this possible. Third, developer productivity -- the idea that we actually need to make developers more productive -- and we measure this in various ways -- I think reliability and scalability are fairly simple -- developer productivity becomes very complicated, because the number of stages that we go through in a pipeline are phenomenally large. We actually have measurements for each of them.

For feature engineering we have measurements of how many features are shared, how quickly does a feature go from inception to production. With training, it is how quickly do you get to the point where you started writing an ML pipeline to the point that you've finished -- it actually went and executed successfully. With inference, how quickly were you able to get it to inference, did you end up failing? Inference, in my opinion, is the closest to standard web infrastructure that we see. It's easiest to measure -- it’s something that we understand. And developer productivity as a whole, you have an idea how long did it take you to get from where you are ‘til you ran an online experiment and we still haven't figured out how to measure that part.

41:46 - Sandhya

So for me I think it's gonna be a balancing act across three things: Starting with the first one is around how do you maintain agility, especially, like most of us talked about earlier, the space is changing very fast and as platform owners we have to stay ahead of the curve. And how do you make sure that you're able to cater to their business needs, be it hardware provisioning, ensuring libraries are available, while keeping up with the reliability of the service in general. I think that's the first one.

The second one, for me, is again with innovation. As we need to keep up with the changing technology space a lot of the ways we used to do it traditionally won't be able to keep up with their rate. How can we quickly know it? What are the sets of tools we apply or the technologies that are out there? It could be around Alert correlation, being able to automate recovery to a large extent without human interaction, so on and so forth -- needs to also change and we need to know it -- how we do things.

And the last one is cost efficiency. Again, while we need to cater to the business, there is also a huge supply chain and vendor management and keeping up with the physical aspects of datacenter space. And learning “How do I balance all of these three things?” is how I would define success for our team.

43:28 - Faisal

For us, a lot of things are similar. Obviously. It sounds like broken records here.

But, the one key thing that perhaps comes slightly differently for us and it's in the nature of the place of our team is: Researcher productivity trumps everything else and so obviously there's no infinite cost -- I mentioned that -- but our most important metric is how quickly and how many AB tests we are able to enable on our platform; how easy it is for a new researcher to come and really understand our custom platform and be able to run their first AB test. So all of those things are sort of the guiding principle through which we measure our success.

Of course, our very close customers measure success in terms of how many product experiences that they're able to push through -- which is an important element for us too -- but, more important than the actual product wins, are “How much innovation have you enabled on the platform?”, “How fast are people able to sort of try out new things in the Netflix product?”. That's how we measure ourselves.

44:38 - Joel

Andrew?

44:39 - Andrew

From a PM perspective, we tend to view it through some of these business metrics -- which working on infrastructure is always a little bit difficult to quantify -- coming from the cloud world, I know that a lot of cloud services do it straight on revenue or profits. And so, moving into more of an infrastructure team where it's harder to get to a business impact just because we're a couple layers deep, we really break it down into number of users for breadth approach to make sure that we have a good number of users and teams within the company using our infrastructure. And then for depth approach, we look for number of models per users and teams as well as the QPS, which is just how many inferences and how many scores are being generated whether it's online or offline, just to see whether we can increase the usage within our infrastructure as well. And then we also track kind of a lot of the infrastructure metrics as well: availability, scalability, latency, costs.

And then, we also tend to put a high importance on long term decision making. This one's really hard to quantify, but it's something that I think is fundamental to infrastructure. The fact that one poor decision or a short-sighted decision could really lead to a lot of cost per team. And it's always difficult to quantify that. It's always hard to justify it as well, because you're preventing an accident from happening. And it's in that pros and cons kind of matrix. And trying to figure out whether it's worth it or not, we tend to heavily emphasize if is worth it because infrastructure generally, as more use cases onboarded hundreds of models, it's hard to iterate. It's hard to change things. And so we want to make sure that we make the right decisions up front and it’s something that doesn't, as a PM you know I struggle with this, it doesn't quantify to a number that I can track. But, it is incredibly important to make sure that we don't end up on fire where we can't really solve all the technical debt that we’ve accumulated. And we made a lot of poor decisions to get to that point. And, usually, that results in you build a v2 on the infra you just had and try to scrap it and it’s a whole big effort and so overall we do put a lot of points on that.

47:01 - Pranav

So everything I could have said has been covered. [laughter].

The one slightly different angle I would add, which I think Sandhya already did add, is the innovation aspect. How much can the infrastructure that we are building today power the use cases of tomorrow? If there's one thing we know about ML it is that the next five years the use cases are going to look very different from the last five years. And how can the ML infrastructure systems that we are building today power those future use cases. And that is one lens with which I look at my systems is are we really powering these new novel scenarios? Of course yes we can do the previous ones better, faster, more reliable; but can we power really new ones which a year ago most people couldn't even have thought of?

47:52 - Joel

Thank You!

Audience Questions

So now it is your turn to dig in to us -- so question time!

48:15 - Audience Member

All of you are working on large-scale ML. In this last round, a bunch of you talked about developer productivity. So could you dig into that? What are the factors that impact developer productivity? Is the kind of frameworks you use, the languages you use, the data you use? Could you shed some light on each of your own views? And finally it is about company success -- it is about how quickly you can move that model back into production that gives you better results, that gives you better advertising revenue, that gives you better recommendations.

49:00 - Faisal

I'm gonna jump in. So, it runs the whole gamut when it comes to productivity. Right from the start, when you are doing very early ad hoc explorations and whether it's the right notebook environment where you want to see the visualizations of the data together, how easy it is to take an early-inception feature into something that's more productized and ready to be launched into a production pipeline. And then, how much time are you spending on understanding or dealing with the challenges of the platform versus the place where you should be spending your time on? How easy does the platform make it for you to debug your problems?

So, it is a whole dimension -- much of which is not new, as software engineering in general has been dealing with this thing -- the additional layer that ML adds is that suddenly data has been elevated to a whole new level. Garbage-in will result in a perfectly wonderful software engineering system leading to a garbage-out result. And that's the key thing that data provenance, data quality - how can the platform help with all of those things -- which are now an additional element that research scientists have to think about. So, there's a whole new set of problems; but also, a lot of common problems that we've been dealing with for years.

50:19 - Joel

One of the challenges, one of the dimensions that we've seen is the artifacts, the systems that get built with ML are so complicated that, for a long time, only the ML engineers could troubleshoot them. So, if there was a bad job recommendation that someone complains to execs [about] and it flows down, it's only the ML team that can work on it. And while they're working on troubleshooting that problem they're not innovating. They're not driving new features. They're not building new capabilities.

So, one of the things that we've been looking at is “Can we make the systems more understandable? Can we make it so that product managers, so that application engineers, can troubleshoot and at least get the first one, two, or three layers of troubleshooting before it brings in the ML engineers."

51:16 - Aditya

So we faced the problem that Joel is talking about, the problem that Faisal is talking about, and I’ll try and give you numbers. So, in our case, distributed training, at its maximum, can last up to 28 days on several machines. And it's 28 days only because we stop it at that point. The other problem is that within those 28 days, if one machine fails for whatever reason, your entire training stops -- you gotta start over.

So we had to come up with mechanisms on how to counter that. We have to come up with mechanisms of how do we make sure that, if it fails, it fails within the first thirty minutes if possible, within the first day at worst, because, that is what engineers want to know -- they don't wait 28 days to see whether something’s going to fail. And, ideally, we should be the ones telling them -- it's the infrastructure provider that should be the ones telling them, not them finding out on their own.

The reason for this is actually really simple: We’ve all invested in making these people more productive. When you make somebody more productive they're going to do more. They're gonna run more experiments. They're gonna train with more data. They can try out more features. Leading to a problem where anything failing, anything not working as they expected, can have larger consequences. Which is why we invest a lot more in making them productive but also providing the guardrails of telling them when something happens.

52:47 - Sandhya

If I may add one more thing.

If you go down the stack, there's obviously thousands of machines supporting everything we do. For example, aging hardware. We do need to take care of that. We are working on a project to replace about 4,000 machines over a short period of time without impacting, like what Aditya said, whatever is happening. Flexible infrastructure upgrades, like OS upgrades, those are all mundane tasks when you do it at machine level, but, when you have to do this at massive scale on live clusters, it is a very hard, complex problem to solve. And I think that's where how we be agile and still be able to cater to the business needs is the balance now.

53:29 - Joel

Do we have another question?

53:38 - Audience Member

Those are really good, panel. Maybe this question is a little too specific, but still, I want to hear your ideas on it.

We know that when things are being deployed in today's operational world and today’s data centers, at least most companies and most applications are now aware that they should be instrumenting their applications to a large extent to get good metrics out, that they can then realize, come back, and then change the software accordingly. How does having machine learning models, machine learning applications, impact the monitoring as such? In the sense that, are there new things that we're supposed to monitor, are we still using the old artifacts, the CPU, disk, memory, is that all? Are there other things that we should be looking for when we are trying to assess a quality use of predictions?

54:30 - Sandhya

So I think we definitely have gotten better and looking at the machine stats, but, I think we are at a point where we have to traverse the call graph. For example, a call to name node has to talk to multiple services like a lab. It needs to make a call through the network, it needs to make a call to DNS. I think traversing the end-to-end graph, and though we have a ton of metrics, we do need to apply metrics correlation because I think it's impossible for humans to look through, traverse through, dashboards and pinpoint where the problem is. That is the space we are dealing with where, though all the applications have gotten better at spewing metrics, finding sense of what's the most critical ones, how do we build it up so there's not a whole lot of noise and you exactly pinpoint and get to it, I think is the hard thing.

At LinkedIn, we have this application ThirdEye, I believe it is open-source now, where leverage it at the business metrics level to identify the noise as to why there is variances. It could be holidays, it could be some other new models that were rolled out, a new LiX ramp -- an experimentation ramp that went in. We leverage platforms like ThirdEye to help us determine that rather than humans doing it. it's beyond humans at this point.

55:56 - Joel

One of the things that's fundamentally different, I think, in ML is: We have this idea of these features and when we're in the inference world, many of these features are extremely ephemeral. And if I'm remembering right, Netflix had this idea of “Capture the facts, not just the features.” and going back, you want to quickly touch on that?

56:18 - Faisal

I can, but, before I talk, let me just answer one part that I think you were asking which is about one of the new metrics that are coming in the world of ML. And one key thing that's different from traditional software engineering monitoring is around the distributions of the data and the distributions of the features. Feature importance, that you're tracking as inferencing is happening live and sometimes data skews can be a whole interesting challenge to deal with because if your data is skewed in a way which is not what you were expecting, your model metrics will be off substantially. Having that kind of monitoring is important.

And to the point that Joel was bringing up about facts versus features: One of the things that we have found that are lot more easier to share our what we call “facts” and think of facts as almost like ground truth or some incontrovertible happening. So for instance, somebody came to the Netflix website and they were impressed by a particular video, and then they added that to their MyList, and they click play, things of those nature and those are far more easier to share because they're incontrovertible and oftentimes can help building up a lot of the effort that goes into developing a feature from scratch. So I think that's the point that you were making.

Final Takeaways

57:44 - Joel

So we're gonna wind up. We've got one last little bit. Each of us is going to move through and leave you one key takeaway.

57:58 - Pranav

So I think the one takeaway I would give is: Adapt and evolve your ML infrastructure as the research world and product scenarios evolve.

58:10 - Andrew

Something else I think would be: Invest in the infrastructure side. ML infrastructure relies on a lot of pieces to go right for you to have really stable, scalable infrastructure. Whether it relies on the data warehouse, whether it relies on other components underneath, and so what I find is, to make an ML infrastructure really successful, it really does take building that core infrastructure piece, and all the dependencies underneath, to be really stable so that you can build a lot of really cool use cases on top of it.

58:51 - Faisal

So full disclosure: Joel had given us a five word limit on the takeaway so I thought really hard on what I'm gonna say! My five words are:

Researcher time over machine time! [clapping and laughing]

59:07 - Sandhya

Awesome! That’s a good one.

So for me I get it -- it’s a five word sentence:

What gets measured gets fixed.

That's a detail that we all try and strive towards.

59:19 - Aditya

I think that Andrew kind of mentioned this:

Build on stable infrastructure, because after you're done with that, you actually have to focus on the higher level pieces: what inference does, how to make people more productive. And having good stable infrastructure will help you with that.

59:36 - Joel

Thank you! It's been a real pleasure working with all of you!

I know that there were hands, up as the friction came down, to ask questions and we weren't able to get to all of you. There's at least two massive social networks that are really valuable for reaching out to folks. I bet you can figure out how to find any of us if you have more questions to dig in on. Plus we'll be around.

So thank you all.

60:02 - Nisha

Thank you very much. Just one really quick thing: Another way to stay connected after the conference is there's a slack channel so please go ahead and join that and I would like to ask the panelists to join that as well.

Thank you, everyone.

Final Thoughts

Have you experienced similar challenges? Consider sharing your experiences and your questions in the comments below!

[photo credit R. Hoe & Co.'s sectuple stereotype perfecting printing press, with folders. (British Museum, HMNTS 10413.dd.1.)]

Matthew Russell

Training for the Zombie Apocalypse

5 年

You continue to be an inspiration in terms of connecting theory and practice! Your mentorship to me as a CS grad student was instrumental in shaping my emerging career in ML, and I continue to benefit from it and to be grateful for it!

1 次回应

Joel Young

ML Infrastructure | Gen AI, Leadership

5 年

By the way, the transcript is not grammatically correct -- other than removing verbal tics, it is how folks talk while simultaneously thinking about difficult things.? If you want to get a sense on the gap between the state-of-the-art for automatic transcription and human, compare this transcript to the close-captioning in the video.

Joel Young

ML Infrastructure | Gen AI, Leadership

5 年

Finally got the video and transcript set for our panel discussion at USENIX OpML '19.? Huge thanks to Nisha Talagala, our chair and host.? To my fellow panelists?Faisal?(Netflix), Andrew?(Airbnb), Sandhya?(LinkedIn), Pranav?(Google), and?Aditya?(Facebook); this was very interesting and I learned a ton putting this together -- it is a small community and I looked forward to working with all of you!? #mlinfra?.? Great job to the?USENIX Association?for putting together this inaugural conference!? #opml19?https://www.usenix.org/conference/opml19 .? ?Finally, we wouldn't have this if Yazhou C.?hadn't recorded it for us!

1 次回应

查看更多评论

要查看或添加评论，请登录

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

2020年7月25日

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

Join us for the final OpML '20 session on bias, ethics, and privacy from the perspective of operational machine…

3 条评论
USENIX OpML '20 - Session 7 - Model Training

2020年7月25日

USENIX OpML '20 - Session 7 - Model Training

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving…
USENIX OpML '20 - Session 6 - Applications and Experiences

2020年7月25日

USENIX OpML '20 - Session 6 - Applications and Experiences

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving…

1 条评论
USENIX OpML '20 - Session 5 - Model Deployment Strategies

2020年7月25日

USENIX OpML '20 - Session 5 - Model Deployment Strategies

Join us for the OpML '20 session on model deployment strategies for operational machine learning, hosted on the USENIX…

4 条评论
USENIX OpML '20 - Session 4 - Algorithms

2020年7月24日

USENIX OpML '20 - Session 4 - Algorithms

Join us for the OpML '20 session on algorithms for operational machine learning, hosted on the USENIX OpML Slack…
Features, Explainability, and Analytics OpML '20 Session 3

2020年7月21日

Features, Explainability, and Analytics OpML '20 Session 3

Join us for the OpML '20 session on Features, Explainability, and Analytics, hosted on the USENIX OpML Slack Workspace…

1 条评论
Joel's Cashew Pesto and Doogh

2019年7月27日

Joel's Cashew Pesto and Doogh

Got basil? Got cucumbers? Here's something #yummy I made up. I don't get much coding time anymore, but I can still use…

7 条评论
Support Traps — A cautionary tale for infrastructure engineers

2019年1月12日

Support Traps — A cautionary tale for infrastructure engineers

BLUF: Avoid the support trap — a kind of success trap many platform engineering teams experience. In 2016, I started…

31 条评论

See all articles

How the Experts Do It: Production ML at Scale

Joel Young

ML Infrastructure | Gen AI, Leadership

USENIX OpML 2019

The Panelists

Host

Moderator

Panelists

Videographer

The Panel

Introduction

Who are our customers?

The SRE Perspective and Operability

Containerization and Deep Learning

Cost-to-serve vs. agility

Sharing ML Artifacts

Extensibility

Our definitions of success

Audience Questions

Final Takeaways

Final Thoughts

Joel Young的更多文章

社区洞察

其他会员也浏览了

Explaining the Methodology Behind DeepSeek-R1

How to Get Started with TIR, the AI Platform, in Minutes

AI Weekly Updates 0120

How Machine Learning is Powering Government Decision Support: A Deep Dive

How to Get Started with TIR, the AI Platform, in Minutes

Decoding closed box Models with SHAP

AI vs ML what's the difference?

Future Ready Digital & AI Digest Vol. 2?№2

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

How to Get Started with TIR, the AI Platform, in Minutes

USENIX OpML 2019

The Panelists

Host

Moderator

Panelists

Videographer

The Panel

Introduction

Who are our customers?

The SRE Perspective and Operability

Containerization and Deep Learning

Cost-to-serve vs. agility

Sharing ML Artifacts

Extensibility

Our definitions of success

Audience Questions

Final Takeaways

Final Thoughts

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

USENIX OpML '20 - Session 7 - Model Training

USENIX OpML '20 - Session 6 - Applications and Experiences

USENIX OpML '20 - Session 5 - Model Deployment Strategies

USENIX OpML '20 - Session 4 - Algorithms

Features, Explainability, and Analytics OpML '20 Session 3

Joel's Cashew Pesto and Doogh

Support Traps — A cautionary tale for infrastructure engineers

社区洞察

其他会员也浏览了

Explaining the Methodology Behind DeepSeek-R1

How to Get Started with TIR, the AI Platform, in Minutes

AI Weekly Updates 0120

How Machine Learning is Powering Government Decision Support: A Deep Dive

How to Get Started with TIR, the AI Platform, in Minutes

Decoding closed box Models with SHAP

AI vs ML what's the difference?

Future Ready Digital & AI Digest Vol. 2?№2

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

How to Get Started with TIR, the AI Platform, in Minutes