What we can learn from a crash
Crashed cars look like this, but crashed software... looks like software errors.

What we can learn from a crash

Today Google officially returned to the office. We have 156,500 employees. So, having a bunch of us all showing up to the office again on the same day meant that we had to ramp up our services here at the office. I work in one of our San Francisco bay area locations, so it's a nice perk to have lunch served in a cafe here at Google's campus. I use an internal Android app called "Eat" to browse the cafe menu, and be aware of various ingredients and food allergens. It's terrific. But today, it was down when I checked it at lunchtime.

Now, before I say anything about why it didn't work, I want to be perfectly clear; I did not fully investigate all the reasons for the trouble, and our post mortem work is not yet complete. I've used the app for nearly five years, and I've never had any problems with it at all until today. It's clearly not a mission critical application for Google, and as a result, has not been allocated the same resources and rigor that an externally facing system has. We don't expect it to be as reliable as our consumer services.

Our "Eat" application, like most apps... even many mission critical ones... is missing a design feature known as graceful degradation. This is one of the key tools used to prevent cascading failures, and is explained in a chapter of our SRE book that you can read online for free. It was very clear that this capability is missing because when I opened it, the app simply could not even draw its initial screen. There was no error message. Just a cute green spinning animation in the middle of an 80% white screen indicating something is loading, but it actually never loaded.

No alt text provided for this image

That's no big deal. There is still food served in the cafe, and there are paper menus posted with what allergens are in the food. I had a salad and sandwich for lunch today, and it was delicious. While I was enjoying it, I was reminded of how many times I've seen problems like this in software systems. Graceful Degradation is actually something that is very rarely employed. Why? Well, because usually software engineering teams don't view it as a priority until we have some kind of a technical disaster. You'll see these features in top services that billions of people use. In your average mobile application? Fat chance.

All large complex systems experience failures. Yes, all of them. Even most simple systems at some point will have some kind of failure, just like my trusty "Eat" app. Simply expecting them to work 100% of the time is unreasonable. Things happen. Is adding Graceful Degradation a big deal? Not usually, but you need to at least spend some time thinking about it, and prepare for dealing with it.

Let's take the example of an overloaded system. Suppose "Eat" went down because of the larger number of people who showed up to our offices today (among other exacerbating factors). What might have helped? Follow along, and in your mind think about those apps that you care about and are responsible for, and how you could improve them from this lesson:

  1. Use short timeouts. When your app requests something from a service over a network, limit the time you'll await a response, and make that time-out relatively short. My rule of thumb is roughly 3X the time it normally takes to complete the task, with a sensible lower floor, especially if you are going to automatically retry.
  2. Use automatic retries with exponential backoff. If you get an error on the first couple of attempts, automatically retry the request, while increasing the delay each time between requests on an simple exponential scale. Slightly randomize the delay so client retries are more evenly spread out over time, rather than synchronized on a single failure event. Limit the total number of attempts.
  3. Quick errors are better than slow failures. If the response does not return after maybe two or three retries, it's not likely to happen at all. If you don't have a degraded mode to enter, go ahead and return an error to the user encouraging them to try again a bit later.
  4. Automatically enter a degraded mode. If your client is getting timeouts from its primary system, can you render a more limited user experience using an alternate system? For example, maybe my "Eat" app could give me a list of web links to the Google Docs that were used to print the daily menus on displayed in the various cafes, sorted by location. Clicking one of the links could open that document in my web browser on my phone. Google Docs has no dependency on the API's used to run "Eat", so the chance of both those systems failing at the same time is much lower. You don't need to run a query over the network to know what a the link to the menu will be.

Chances are that we will probably take the time to make changes like this now. Will it take a significant amount of software engineering time or system resources to implement changes like this? I doubt it. I'd bet this could be done in a matter of a few hours of work, and maybe a couple of days of testing. Suppose you took the time now to make this set of changes to your app. Might this help you out on that inevitable occasion when your systems don't work as planned? Try it and see.

Follow Adrian Otto on?Linked-In, and?Twitter.

Pankaj Kenjale

Head of AI, Analytics & Data R&D | Generative AI | Building AI driven products & platforms for billions of users

2 年

Good one with key learnings from the failure !

要查看或添加评论,请登录

Adrian Otto的更多文章

  • Recognize Learning, Not Outcomes

    Recognize Learning, Not Outcomes

    I believe that some of our willingness to work hard is born into us, or learned at a very early age. The rest we learn…

    12 条评论
  • How to Avoid Tech Debt Bankruptcy

    How to Avoid Tech Debt Bankruptcy

    Tech Debt Bankruptcy is when your software teams become so busy fixing problems with your existing systems that they…

    9 条评论
  • Bigger Monitor == More Productive

    Bigger Monitor == More Productive

    The bigger your monitors, the more productive and higher quality work will be produced by those performing complex…

    14 条评论
  • Consolidation Loan for Tech Debt

    Consolidation Loan for Tech Debt

    Using an Anti-corruption Layer for Managing Technical Debt If you or someone you know has ever struggled to emerge from…

  • AI is a Human Adaptation

    AI is a Human Adaptation

    Did you know that we can use a smartphone app to point a camera at a sign in a foreign language and see that sign…

    4 条评论
  • Advice for Preventing a Tragedy

    Advice for Preventing a Tragedy

    I’m a father of four. This morning my teenage son rushed in and exclaimed that his power went out in his room.

    13 条评论
  • Push the Limit

    Push the Limit

    6 ways to prevent cascading failures Working closely with a number of the world’s largest tech companies, including…

    8 条评论
  • Artfully Balancing Technical Debt

    Artfully Balancing Technical Debt

    Zero debt completely? Chief executives care about satisfying a delicate balance of interests between various…

    3 条评论
  • Achievements in Quantum Computing

    Achievements in Quantum Computing

    A year in review What do Quantum Computing, Chemistry, Artificial Intelligence, and Open Source Software all have in…

  • Management != Leadership

    Management != Leadership

    Does Inspiration come from Management? Today I was reflecting on thoughts expressed by Jessica Norlander in her recent…

    8 条评论

社区洞察

其他会员也浏览了