What we can learn from a crash
Today Google officially returned to the office. We have 156,500 employees. So, having a bunch of us all showing up to the office again on the same day meant that we had to ramp up our services here at the office. I work in one of our San Francisco bay area locations, so it's a nice perk to have lunch served in a cafe here at Google's campus. I use an internal Android app called "Eat" to browse the cafe menu, and be aware of various ingredients and food allergens. It's terrific. But today, it was down when I checked it at lunchtime.
Now, before I say anything about why it didn't work, I want to be perfectly clear; I did not fully investigate all the reasons for the trouble, and our post mortem work is not yet complete. I've used the app for nearly five years, and I've never had any problems with it at all until today. It's clearly not a mission critical application for Google, and as a result, has not been allocated the same resources and rigor that an externally facing system has. We don't expect it to be as reliable as our consumer services.
Our "Eat" application, like most apps... even many mission critical ones... is missing a design feature known as graceful degradation. This is one of the key tools used to prevent cascading failures, and is explained in a chapter of our SRE book that you can read online for free. It was very clear that this capability is missing because when I opened it, the app simply could not even draw its initial screen. There was no error message. Just a cute green spinning animation in the middle of an 80% white screen indicating something is loading, but it actually never loaded.
That's no big deal. There is still food served in the cafe, and there are paper menus posted with what allergens are in the food. I had a salad and sandwich for lunch today, and it was delicious. While I was enjoying it, I was reminded of how many times I've seen problems like this in software systems. Graceful Degradation is actually something that is very rarely employed. Why? Well, because usually software engineering teams don't view it as a priority until we have some kind of a technical disaster. You'll see these features in top services that billions of people use. In your average mobile application? Fat chance.
领英推荐
All large complex systems experience failures. Yes, all of them. Even most simple systems at some point will have some kind of failure, just like my trusty "Eat" app. Simply expecting them to work 100% of the time is unreasonable. Things happen. Is adding Graceful Degradation a big deal? Not usually, but you need to at least spend some time thinking about it, and prepare for dealing with it.
Let's take the example of an overloaded system. Suppose "Eat" went down because of the larger number of people who showed up to our offices today (among other exacerbating factors). What might have helped? Follow along, and in your mind think about those apps that you care about and are responsible for, and how you could improve them from this lesson:
Chances are that we will probably take the time to make changes like this now. Will it take a significant amount of software engineering time or system resources to implement changes like this? I doubt it. I'd bet this could be done in a matter of a few hours of work, and maybe a couple of days of testing. Suppose you took the time now to make this set of changes to your app. Might this help you out on that inevitable occasion when your systems don't work as planned? Try it and see.
Head of AI, Analytics & Data R&D | Generative AI | Building AI driven products & platforms for billions of users
2 年Good one with key learnings from the failure !