?? Navigating Through the Chaos: Lessons from a Recent Outage ??
I was at the airport trying to figure out the travel plans for my family after the Crowdstrike/Microsoft outage that disrupted airlines all across the globe. I remember feeling the same sense of frustration and anger that so many on LinkedIn have already written about. ?? But once things settled down and travel uncertainties got ironed out, I also recall feeling a sense of empathy for the developer/engineer who pushed the infamous update out—hoping that they would not be too harsh on themselves. ????
As is often the case, something like this is not the fault of one engineer, but rather gaps in process that could usually span multiple teams. Reviewers, QA, and infrastructure engineers, to name a few of those who would be associated with this change. And of course the organizational process for pushing out a change like this where policies for gradual and risk-free promotion of changes are due. ??
?All this got me thinking about how we would have handled a situation like this at Aampe. ??
Here’s how we handled a recent outage at Aampe:???
?? Limit the Damage: Once we were informed of the issue, the very first thing the team figured out was to take steps that limit the ongoing damage. ?? All agents that were running and personalizing experiences were paused till we had more information.?
领英推荐
?? Determine the Cause of the Issue: Next, the team started to trace the steps the data took to identify at what point the issue was first observed. Once that was identified, we zoomed-in to the moment in time? when it was observed. ???♂? With those two pieces of information, we could track all the changes that had been deployed to production. We determined that a recent Python package version upgrade was responsible for the issue. ??
?? Develop the Fix and Test: The team had two options: roll back the upgrades or make a code change to account for the changed library behavior. Because the code change was simple, it was determined that we would keep the upgrades and make the fix. ?? We spent more time on testing the fix than on the actual fix, splitting up the tests by OS, customer, and technology stack. All tests were run in parallel. ?????
?? Postmortem: The postmortem was done a few days after the incident to allow the team some time to recover and reflect on the outage. ??♂? It is usually orchestrated by a different team member each time. We start by establishing a timeline of events as they happened. The first goal is to diagnose. Our Slack channels and threads come in handy here. Next, we work through the contributing factors. Identify things that made it harder to detect the issue. We then looked into process improvements that would have prevented this issue from making it to production. ???
The key takeaway? At no point in the incident management process energy was invested into identifying who was to blame. It’s always about what happened, when it happened, how it can be fixed, and how it could be prevented from happening again. It allows us to learn as a team and build confidence individually, fostering team camaraderie and cohesion. ??