登录查看更多内容

Support Traps — A cautionary tale for infrastructure engineers

Joel Young

ML Infrastructure | Gen AI, Leadership

发布日期: 2019年1月12日

BLUF: Avoid the support trap — a kind of success trap many platform engineering teams experience.

In 2016, I started managing LinkedIn’s search federation platform. The platform provides a unified interface to the many topic-specific search engines like articles, jobs, and people. Each engine has its own workflow to re-write the queries plus a master workflow that blends all of the topic-specific results. The system worked well — if we needed to add a new class of documents, we only needed to build the query-rewriter and add it to the blending workflow. To avoid having to build and deploy multiple services, all the workflows ran together in the same service.

Fast-forward several months and it’s a few weeks before the end of the quarter (crunch time). Two customer teams dropped some large and complicated workflow revisions for code review. There was extensive back and forth on the reviews but we bent over backwards and got the updates cleaned and deployed. In the process, we slipped some of our strategic work into the next quarter. Things were looking good for a few weeks, but then all hell broke loose with the Java-based service going into a massive garbage collection death spiral. This was our first major site issue in over a year. The post mortem revealed that a UDF had been added into the query rewriting workflow that was doing a KVS lookup and caching it for performance, but the cache TTL was just long enough to push the contents into old-gen.

To limit the impact of future faults, we started designing a system to provide tenant isolation between the workflows. Unfortunately, our support load continued accelerating. It turned out that, for the past year, the company had been doing a complete front-end rework rather than on new product features, but that work was done and now requirements for the backend were increasing in number and complexity. Furthermore, given our previous site-issue, we were especially careful with reviews. As a result, changes were taking multiple weeks to ship.

Although the tenant isolation work would have allowed us to unblock our customers, we didn’t have time to build it. What had been a great platform was becoming a boat anchor with grumpy customers and stressed-out engineers.

How many of us have had similar experiences? If you build horizontal infrastructure for a good sized company, I bet you have. I’ve discussed this with engineering managers from several of the valley’s large internet players and every single one of them had similar experiences. I personally have experienced it a couple of times — and I had never recognized that it was a thing. A thing I call the Support Trap.

Support Trap

A support trap is a situation in which the work to support customers starves out other work. It is at risk of occurring when a team builds a key horizontal platform or capability for other customers, internal or external, to build their products on. For products, one can measure business metrics such as revenue and user engagement. However for platforms, like the search federation system introduced above, it isn’t so easy. Rather than capturing business value, success is often measured by the number of user facing products leveraging the platform, the transaction rate (e.g. queries per second), or the cost per transaction. This leads to a perverse situation where there is a huge reward to onboard customers quickly and at scale. Build the minimum viable product and move-on to the next target.

The trap comes when the effort to maintain the platform to meet the needs of the customers consumes all available resources. No engineers are left to work our way out of the trap. Below we’ll discuss some strategies for getting out of and avoiding the support trap. But first we need to understand that the support trap is just a particular form of Success Trap.

Success Traps

I was talking with a manager in a core systems team who had gotten similarly stuck and was getting bad reviews. At about the same time I read a blog post by Brendan Reid and had a personal epiphany. We were getting stuck in a particular kind of success trap. What’s that you ask? One blogger described it as

The success trap is like living in an invisible box. You can’t see the edges, but you’re still bound by unwritten rules.

There are many types of these traps and many classic examples, perhaps the most famous of which is that of Eastman Kodak. In 1975, Steven Sasson built the first digital camera. He had tried to get company leadership interested, but they didn’t bite. He described their response as

They were convinced that no one would ever want to look at their pictures on a television set,” he said. “Print had been with us for over 100 years, no one was complaining about prints, they were very inexpensive, and so why would anyone want to look at their picture on a television set?

Despite dominating the photographic film industry, they started struggling at the end of the twentieth century. They tried to pivot into digital photography and printing, but it was too late and they entered bankruptcy in 2012.

What is the fundamental pattern? The essence is overweighting exploitation in the explore-exploit tradeoff space. The individual or organization had an initial series of successes and learned a formula for success. They then shifted their resources to leveraging (exploiting) this pattern to the exclusion of continuing to explore their space and learning more about deepening their success. Using a machine learning metaphor, one might consider it a form of over-fitting or of getting stuck in a local optima. We saw this with Kodak exploiting their position in the film industry, and we saw it with my team onboarding customers onto our federator without considering the long-term consequences.

In a must-read 1993 classic, Levinthal and March detail a large taxonomy of success traps. These include traps of distinctive competence, overlooking distant times, and traps of power. Support traps are a blend of the first two with sometimes a touch of the third.

First, the team builds a particular competence in describing and solving customer problems with their platform. They learn that by exploiting this competence they can bring quick rewards to both their customers and to themselves. In and of itself this is fine. Second, there is a tendency to not look to explore distant second order effects. Combining these leads to the success trap. Complicating things, there is often a dynamic pressuring the customers to adopt the new platform as early as possible. Sometimes this is simply an economic decision but sometimes it is organizational with a mandate from leadership to migrate to a new platform, perhaps as part of a vertical to horizontal restructuring.

Escaping the Trap

Once a team gets stuck in a support trap it is challenging to get out of it. And the longer one goes before identifying the problem, the harder it is to escape. There is only one way out: reduce the portion of the team’s time spent supporting customers. We can do this by dropping customers, by reducing the per-customer support cost, or by adding engineers. None of these are free.

I have seen dropping or firing customers work out. In fact, I was one of the fired customers. LinkedIn had a great new distributed NoSQL database that among other things was to replace our legacy Voldemort key value store. In 2014, we had migrated about half of our use-cases to use the Espresso store when we were asked to reverse the migration as the Espresso team found itself in a support trap and needed to focus their efforts on a different part of the solution space. This was painful and was short-term hard on the service’s reputation, but in the long run was the right answer — Espresso is now the core key-value store platform at LinkedIn. As LinkedIn’s open source Pinot platform matured, it turned out to be a better fit for our use-case. Sometimes the customers who are painful to support shouldn’t be supported.

The next approach is to reduce the per-customer support costs. Unless one takes the blitzscaling strategy of ignoring customers, this requires engineering, documentation, process, and training efforts — none of which are free — and once in the trap, all of the team’s energy is being spent on support. The team needs to find additional engineers, either as permanent headcount or as loans, perhaps from the customers themselves. However, if it took too long to identify the support trap, the customers are already getting bad service. They and the leadership will be reluctant to send “good-money-after-bad.” To make enough headroom, one may need to drop some customers, ignore others, and add engineers to the project. Escaping the trap at this point may require dramatic efforts, including reorganization.

As noted above, this happened with me. I had three different platforms, together hosting over fifty product workflows. We were maintaining our strategic efforts until our product teams completed the new LinkedIn desktop experience and suddenly the velocity of their changes skyrocketed. We had an engineering plan in flight to make it much easier to support customers, but it was completely starved out. The backlog was so great, it was taking us three weeks to review even small changes, causing several teams to miss their quarterly goals.

The solution was to completely refactor the team with each of the platforms moving to a more vertically-aligned organization while also adding additional engineers. For example, the search federation platform moved to our search infrastructure organization. You can read more about how they are reducing the support cost for the service in their blog post including moving to a new workflow engine and building isolation between the vertical use-cases.

Avoiding the Support Trap

The key to avoiding all success traps is to keep sufficient energy in explore while looking forward with a long enough horizon. In horizontal teams we often try to do this by ensuring ongoing investments in long-term strategic bets — and not letting short-term priorities starve them out. This isn’t easy as when short-horizon opportunities come up it can be too easy to tap into these resources. A way to frame this is to start thinking of your system as a Platform as a Service (PaaS) that needs to not only be able to both scale with load, but also to be able to scale with the number of customers.

A side benefit of the PaaS approach is that you might be able to charge for service — and if the payment can be converted to additional engineers, it can directly help to keep the trap at bay. Another approach is to build a solutions team to help customers adopt and use. This cost can be amortized across multiple platforms.

In writing this note, I hope to help other engineering managers and tech leads avoid mistakes like mine. As I learn more, maybe we'll see a sequel with more details on avoiding the trap.

Have you experienced similar challenges? Please consider sharing your experiences and how you worked through them in the comments below!

Bef Ayenew

Technology Leader

5 年

Great observation, Joel.? On Identity, the ISB team built WiFi (a self-service platform for doing data fixes and migrations) to get around one of these support traps.? ?

1 次回应

陈敏

加州理工大学博士

5 年

Great observation, really like your explore-exploit analogy for infra team dilemma.

1 次回应

Vipin Patel

6 年

Awesome :)

1 次回应

Sachin K.

India Site Lead Driving Innovation in Privacy, Safety & Security | Building a Trusted Digital Ecosystem

6 年

Great article, Joel! Nice use of ML metaphors, explore-exploit and overfitting. We've also seen some parts of this journey with Unified Content Filtering (UCF) platform. We were able to resource two fast-serve projects to remain out of these traps. There was a time in ~2016, when it took ~1 quarter to onboard a new client, we got this under few weeks in ~2017 with better APIs/Process/Docs/FAQs, and now it is being reduced to couple of hours by representing code as data (with auto-generated code) which is configured via a new U/I. The big jumps Q -> Wks -> Hrs helped us to stay ahead of incoming rate of clients.

3 次回应

查看更多评论

要查看或添加评论，请登录

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

2020年7月25日

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

Join us for the final OpML '20 session on bias, ethics, and privacy from the perspective of operational machine…

3 条评论
USENIX OpML '20 - Session 7 - Model Training

2020年7月25日

USENIX OpML '20 - Session 7 - Model Training

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving…
USENIX OpML '20 - Session 6 - Applications and Experiences

2020年7月25日

USENIX OpML '20 - Session 6 - Applications and Experiences

Join us for the OpML '20 session on operational machine learning issues from the point of view of practitioners solving…

1 条评论
USENIX OpML '20 - Session 5 - Model Deployment Strategies

2020年7月25日

USENIX OpML '20 - Session 5 - Model Deployment Strategies

Join us for the OpML '20 session on model deployment strategies for operational machine learning, hosted on the USENIX…

4 条评论
USENIX OpML '20 - Session 4 - Algorithms

2020年7月24日

USENIX OpML '20 - Session 4 - Algorithms

Join us for the OpML '20 session on algorithms for operational machine learning, hosted on the USENIX OpML Slack…
Features, Explainability, and Analytics OpML '20 Session 3

2020年7月21日

Features, Explainability, and Analytics OpML '20 Session 3

Join us for the OpML '20 session on Features, Explainability, and Analytics, hosted on the USENIX OpML Slack Workspace…

1 条评论
Joel's Cashew Pesto and Doogh

2019年7月27日

Joel's Cashew Pesto and Doogh

Got basil? Got cucumbers? Here's something #yummy I made up. I don't get much coding time anymore, but I can still use…

7 条评论
How the Experts Do It: Production ML at Scale

2019年6月7日

How the Experts Do It: Production ML at Scale

Machine learning is driving virtually every major online service we use. In this panel, top experts from across the…

4 条评论

See all articles

Support Traps — A cautionary tale for infrastructure engineers

Joel Young

ML Infrastructure | Gen AI, Leadership

BLUF: Avoid the support trap — a kind of success trap many platform engineering teams experience.

Support Trap

Success Traps

Escaping the Trap

Avoiding the Support Trap

Joel Young的更多文章

社区洞察

其他会员也浏览了

Service Threat Engineering: Taking a Page from Site Reliability Engineering

Docker

Istio Fault Injection: Introducing Faults for Resilience Testing

Treating 'TechDebt' Traumas

Understanding SBOMs: A Crucial Component for Modern Software Engineering

Picking the correct software engineering patterns for project triumph

Measurements “The Good”, “The Bad”, and “The Ugly” and how to design measurements that improve performance metrics when evaluating Engineers!

Who is a sociotechnical engineer, and why should you care?

Happiest Software Architect . >1/12

Launching ???? ... Resilient Huddle: Secure Design Workshop for Startups

BLUF: Avoid the support trap — a kind of success trap many platform engineering teams experience.

Support Trap

Success Traps

Escaping the Trap

Avoiding the Support Trap

Joel Young的更多文章

USENIX OpML '20 - Session 8 - Bias, Ethics, and Privacy

USENIX OpML '20 - Session 7 - Model Training

USENIX OpML '20 - Session 6 - Applications and Experiences

USENIX OpML '20 - Session 5 - Model Deployment Strategies

USENIX OpML '20 - Session 4 - Algorithms

Features, Explainability, and Analytics OpML '20 Session 3

Joel's Cashew Pesto and Doogh

How the Experts Do It: Production ML at Scale

社区洞察

其他会员也浏览了

Service Threat Engineering: Taking a Page from Site Reliability Engineering

Docker

Istio Fault Injection: Introducing Faults for Resilience Testing

Treating 'TechDebt' Traumas

Understanding SBOMs: A Crucial Component for Modern Software Engineering

Picking the correct software engineering patterns for project triumph

Measurements “The Good”, “The Bad”, and “The Ugly” and how to design measurements that improve performance metrics when evaluating Engineers!

Who is a sociotechnical engineer, and why should you care?

Happiest Software Architect . >1/12

Launching ???? ... Resilient Huddle: Secure Design Workshop for Startups