登录查看更多内容

What we can learn from a crash

Adrian Otto

Technical Director, Office of the CTO, Google

发布日期: 2022年4月4日

Today Google officially returned to the office. We have 156,500 employees. So, having a bunch of us all showing up to the office again on the same day meant that we had to ramp up our services here at the office. I work in one of our San Francisco bay area locations, so it's a nice perk to have lunch served in a cafe here at Google's campus. I use an internal Android app called "Eat" to browse the cafe menu, and be aware of various ingredients and food allergens. It's terrific. But today, it was down when I checked it at lunchtime.

Now, before I say anything about why it didn't work, I want to be perfectly clear; I did not fully investigate all the reasons for the trouble, and our post mortem work is not yet complete. I've used the app for nearly five years, and I've never had any problems with it at all until today. It's clearly not a mission critical application for Google, and as a result, has not been allocated the same resources and rigor that an externally facing system has. We don't expect it to be as reliable as our consumer services.

Our "Eat" application, like most apps... even many mission critical ones... is missing a design feature known as graceful degradation. This is one of the key tools used to prevent cascading failures, and is explained in a chapter of our SRE book that you can read online for free. It was very clear that this capability is missing because when I opened it, the app simply could not even draw its initial screen. There was no error message. Just a cute green spinning animation in the middle of an 80% white screen indicating something is loading, but it actually never loaded.

That's no big deal. There is still food served in the cafe, and there are paper menus posted with what allergens are in the food. I had a salad and sandwich for lunch today, and it was delicious. While I was enjoying it, I was reminded of how many times I've seen problems like this in software systems. Graceful Degradation is actually something that is very rarely employed. Why? Well, because usually software engineering teams don't view it as a priority until we have some kind of a technical disaster. You'll see these features in top services that billions of people use. In your average mobile application? Fat chance.

领英推荐

Here’s one part of Amazon that isn’t growing

GeekWire 4 个月前

Here’s What Google Maps Timeline Knows About You

TOSS C3 2 个月前

Here’s What Google Maps Timeline Knows About You

TOSS C3 1 年前

All large complex systems experience failures. Yes, all of them. Even most simple systems at some point will have some kind of failure, just like my trusty "Eat" app. Simply expecting them to work 100% of the time is unreasonable. Things happen. Is adding Graceful Degradation a big deal? Not usually, but you need to at least spend some time thinking about it, and prepare for dealing with it.

Let's take the example of an overloaded system. Suppose "Eat" went down because of the larger number of people who showed up to our offices today (among other exacerbating factors). What might have helped? Follow along, and in your mind think about those apps that you care about and are responsible for, and how you could improve them from this lesson:

Use short timeouts. When your app requests something from a service over a network, limit the time you'll await a response, and make that time-out relatively short. My rule of thumb is roughly 3X the time it normally takes to complete the task, with a sensible lower floor, especially if you are going to automatically retry.
Use automatic retries with exponential backoff. If you get an error on the first couple of attempts, automatically retry the request, while increasing the delay each time between requests on an simple exponential scale. Slightly randomize the delay so client retries are more evenly spread out over time, rather than synchronized on a single failure event. Limit the total number of attempts.
Quick errors are better than slow failures. If the response does not return after maybe two or three retries, it's not likely to happen at all. If you don't have a degraded mode to enter, go ahead and return an error to the user encouraging them to try again a bit later.
Automatically enter a degraded mode. If your client is getting timeouts from its primary system, can you render a more limited user experience using an alternate system? For example, maybe my "Eat" app could give me a list of web links to the Google Docs that were used to print the daily menus on displayed in the various cafes, sorted by location. Clicking one of the links could open that document in my web browser on my phone. Google Docs has no dependency on the API's used to run "Eat", so the chance of both those systems failing at the same time is much lower. You don't need to run a query over the network to know what a the link to the menu will be.

Chances are that we will probably take the time to make changes like this now. Will it take a significant amount of software engineering time or system resources to implement changes like this? I doubt it. I'd bet this could be done in a matter of a few hours of work, and maybe a couple of days of testing. Suppose you took the time now to make this set of changes to your app. Might this help you out on that inevitable occasion when your systems don't work as planned? Try it and see.

Follow Adrian Otto on?Linked-In, and?Twitter.

Pankaj Kenjale

Head of AI, Analytics & Data R&D | Generative AI | Building AI driven products & platforms for billions of users

2 年

Good one with key learnings from the failure !

1 次回应

要查看或添加评论，请登录

Adrian Otto的更多文章

Recognize Learning, Not Outcomes

2023年1月31日

Recognize Learning, Not Outcomes

I believe that some of our willingness to work hard is born into us, or learned at a very early age. The rest we learn…

12 条评论
How to Avoid Tech Debt Bankruptcy

2021年12月14日

How to Avoid Tech Debt Bankruptcy

Tech Debt Bankruptcy is when your software teams become so busy fixing problems with your existing systems that they…

9 条评论
Bigger Monitor == More Productive

2021年2月19日

Bigger Monitor == More Productive

The bigger your monitors, the more productive and higher quality work will be produced by those performing complex…

14 条评论
Consolidation Loan for Tech Debt

2021年1月22日

Consolidation Loan for Tech Debt

Using an Anti-corruption Layer for Managing Technical Debt If you or someone you know has ever struggled to emerge from…
AI is a Human Adaptation

2020年12月29日

AI is a Human Adaptation

Did you know that we can use a smartphone app to point a camera at a sign in a foreign language and see that sign…

4 条评论
Advice for Preventing a Tragedy

2020年12月17日

Advice for Preventing a Tragedy

I’m a father of four. This morning my teenage son rushed in and exclaimed that his power went out in his room.

13 条评论
Push the Limit

2020年12月15日

Push the Limit

6 ways to prevent cascading failures Working closely with a number of the world’s largest tech companies, including…

8 条评论
Artfully Balancing Technical Debt

2020年10月30日

Artfully Balancing Technical Debt

Zero debt completely? Chief executives care about satisfying a delicate balance of interests between various…

3 条评论
Achievements in Quantum Computing

2020年10月23日

Achievements in Quantum Computing

A year in review What do Quantum Computing, Chemistry, Artificial Intelligence, and Open Source Software all have in…
Management != Leadership

2020年10月16日

Management != Leadership

Does Inspiration come from Management? Today I was reflecting on thoughts expressed by Jessica Norlander in her recent…

8 条评论

See all articles

What we can learn from a crash

Adrian Otto

Technical Director, Office of the CTO, Google

领英推荐

Adrian Otto的更多文章

社区洞察

其他会员也浏览了

How to Speed Up Google Chrome

Lessons from 5 Fallen Tech Empires over Last Decade

The State of Google – Really the STATE of Google.

this literally just helped me

I Left My Google Job. And Life Became Hard.

Sic Transit Google.....

Internet Explorer Has Lost All Support (What You Need to Know)

Google is Big - Getting Errors Corrected is Challenging!

5 Googlers Worth Following - Pt 3

Why you should take positive and well intentioned risks in and out of work.

领英推荐

Adrian Otto的更多文章

Recognize Learning, Not Outcomes

How to Avoid Tech Debt Bankruptcy

Bigger Monitor == More Productive

Consolidation Loan for Tech Debt

AI is a Human Adaptation

Advice for Preventing a Tragedy

Push the Limit

Artfully Balancing Technical Debt

Achievements in Quantum Computing

Management != Leadership

社区洞察

其他会员也浏览了

How to Speed Up Google Chrome

Lessons from 5 Fallen Tech Empires over Last Decade

The State of Google – Really the STATE of Google.

this literally just helped me

I Left My Google Job. And Life Became Hard.

Sic Transit Google.....

Internet Explorer Has Lost All Support (What You Need to Know)

Google is Big - Getting Errors Corrected is Challenging!

5 Googlers Worth Following - Pt 3

Why you should take positive and well intentioned risks in and out of work.