Lessons Learned from Failure: When Slack (and the World) Cannot Use Slack
Bill Zajac
Solution Architect Leader @ Amazon Web Services (AWS) | Bridging Humanity and Technology in Cloud Technologies
In May Slack had a significant outage. This was felt over many social channels and businesses that relied heavily off of the service for collaboration, notification, and automation. So what happened? How does a company with a top notch engineering team with access to some of the most advanced technology fail? Slack has shared the experience of what happened on May 12, 2020 for the world to digest. What follows are lessons highlighted from the article posted to ZDnet by Liam Tung. Let’s revisit that day and see what we can learn from Slack’s process:
---
Article: Like everyone else during lockdown, Slack's incident response team turned to a Zoom video meeting, which was not organized on Slack, but via a company-wide email – the medium that Slack was supposed to kill in the workplace.
Lesson: There is no such thing as a perfect environment. At some point, there will be a corner case that requires a war room or triage process. It is imperative that all organizations think as proactively as possible to prevent any issues, but still recognize that failures will happen. This is why observability is such a hot topic right now. The industry has recognized it is impossible to configure an alert for EVERYTHING that can go wrong. Observability is having a state of confidence that there is enough visibility into ANYTHING inside the environment that can be INVESTIGATED if needed.
---
Article: "In such unfortunate situations where we aren't able to rely on Slack, we have prescribed alternative methods of communication," says Ryan Katkov, a senior engineering manager at Slack.
Lesson: Slack is VERY prepared. They have backup plans for backup plans. This is an extremely mature organization which most enterprises would be astounded at just how far away their own process and engineering practices are in comparison to Slack. If issues can happen to Slack – they are capable of happening ANYWHERE.
---
Article: "Following incident runbooks, we quickly moved the incident to Zoom and elevated the response to a Sev-1, our highest severity level."
Lesson: Let machines/automation do everything they can do. If at any point there is a predetermined amount of tasks or steps to complete based on a given scenario and outputs - do not waste a person's time doing them! Computers are faster and less prone to variability.
---
Article: Zoom appears to have adeptly handled Slack's crisis communications needs when Slack was down.
Lesson: In certain lenses, Slack and Zoom are competitive. However there are significant enough differences between the two that warrant an investment in each. Like Slack has done, it is critical to focus on the use case of the situation your organization needs and not focus on features and functionality.
---
Article: "Engineers from different teams joined on the Zoom call, scribes were recording the conversation on Zoom for later dissemination, opinions and solutions were offered. Graphs and signals were shared," writes Katkov.
Lessons: The war room is necessary, but inherently inefficient. Whenever you have two or more specialists get into a meeting to determine which individuals' realm of responsibility caused the problem, the innocent party's time resulted in zero progress. I would argue there is always ONE needle in the haystack and everyone else involved in the search provides negative value to the larger organization.
---
Article: "It's a flurry of action and adrenaline that the Incident Commander must guide and keep the right people focused so that the most appropriate action can be taken, balancing risk vs restoring service. Multiple workstreams of investigation were established and assigned with the goal of bringing remediation in the most efficient manner."
Lesson: The war room is a high pressure situation where decisions and priorities are managed by a human. Although data is provided to the individual responsible for making a decision, there is a certain amount of judgement that gets factored into the decision making process. Oftentimes, previous offenders are the first to be considered guilty and have to prove their innocence even if they are not showing any symptoms of causing the problem.
---
Article: But, as Laura Nolan, a Slack site reliability engineer explains, Slack's systems began tripping up at 8:30am when alarms were raised over its Vitess database tier for serving user data, which faced a spike in requests after Slack made a configuration change that triggered a "long-standing performance bug".
Lesson: There will always be anomalies in the system. In this scenario there was not only an alert that was generated but this shows that systems tend to show degradation before they start failing. The actual outage from an end user perspective was not recorded until 4:45PM PST - over eight hours after the first detection of an issue occurred.
---
Article: Her account of the outage – 'A terrible, horrible, no-good, very bad day at Slack' – reveals that the extra load on Slack's database was contained by rolling back the configuration change. But the incident had a knock-on effect for its main web app tier, which carried "significantly higher numbers of instances" during the pandemic.
Lesson: Solving one incident created another. The decision to roll back to “the last healthy state” agitated a bug that would eventually cause an end user impacting issue. Slack engineers did what was right at the time by solving a bottle neck at a lower tier that could eventually have an impact on the end user. This did create a new path which would cause the overall system to fail.
---
Article: "We increased our instance count by 75% during the incident, ending with the highest number of webapp hosts that we've ever run to date. Everything seemed fine for the next eight hours – until we were alerted that we were serving more HTTP 503 errors than normal," explains Nolan.
Lesson: A modern system is never stable. There is always change being introduced and hopefully most changes are positive. This could be an increase in user traffic due to more customers or a new feature being rolled out. Slack has the visibility required to immediately determine if changes are overall positive or negative. The change to the rollback did seem like although more scale was required, health metrics indicated that the system was functioning as expected - from a service perspective.
---
Article: She details a series of issues that affected Slack's fleet of HAProxy software instances for its load balancer. Those issues ended up affecting its web app instances, which began to outnumber available slots within the HAProxy.
Lesson: Systems with even the simplest of tasks can be extremely complex. Each component needs to function independently, be run efficiently, AND leverage horizontal and vertical dependencies in a manner that does not cause harm to the overall larger system.
---
Article: "However, over the course of the day, a problem developed. The program that synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running."
Lesson: Slack is a well architected, massive scale environment. It has a significant amount of fault tolerance capabilities. The focus and validation from the team handling the issue was on an area of the subset of the system. Anything anomalous outside of that specific focused area will likely be dismissed as unrelated. All changes have impacts, but one change is not that cause for EVERY event that happens around the same time.
---
Article: That program began to fail and exit early because it was unable to find any empty slots, so the running HAProxy instances weren't getting their state updated.
Lesson: Here we have what could be identified as the root cause of the problem. The logic of the script stated “find an open slot, then free up stale connections”. Since many tasks can be going on in parallel, competing scaling events could be causing the shutdown of a component type while at the same time other events could be starting to spin up those same types of resources.
---
Article: "As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale."
Lesson: Slack had a problem that went undetected for eight hours. During this period Slack was still able to provide uninterrupted service to the customer base. If deeper inspection was done, SPECIFIC components inside of the environment would have shown anomalous behavior when compared against previous states.
---
Article: The outage happened after Slack scaled down its web app tier in line with the end of the business day in the US when traffic typically falls.
Lesson: This scenario showcases why the issue happened towards the end of the work day. The issue was actually caused by proactively reducing the amount of resources the overall system was running with. This regularly would be done for most modern systems to reduce cost.
---
Article: "Autoscaling will preferentially terminate older instances, so this meant that there were no longer enough older webapp instances remaining in the HAProxy server state to serve demand."
Lesson: There are parallel initiatives inside of every modern system. In this example, there is the service delivery that needs to be upheld as well as the mechanisms tasked with making the most efficient system possible. There are many other processes in place managing specific tasks inside of the environment. Some other examples which can run are deployment events and security scans.
---
Article: That problem was fixed, but Nolan and her team didn't understand why its monitoring system didn't flag the issue earlier. It seems the monitoring system's history of non-failure led to it being ignored until it failed.
Lesson: Great engineers, like those at Slack, respond to failures with looking for opportunities for improvements. Notice how they addressed the logical issue their current monitoring system had.
---
Article: "The broken monitoring hadn't been noticed partly because this system 'just worked' for a long time, and didn't require any change. The wider HAProxy deployment that this is part of is also relatively static. With a low rate of change, fewer engineers were interacting with the monitoring and alerting infrastructure," explains Nolan.
Lesson: If it is not broke, do not fix it - but still validate it is healthy.
---
Article: The other reason the HAProxy stack was ignored by engineers was that Slack is moving to Envoy Proxy for its ingress load-balancing needs.
Lesson: All organizations should be investing in the future and constantly looking for improvement. There is a balance that needs to occur between innovation but also validate the overall current state is still operating as expected.
---
Article: According to Nolan, Slack's new load-balancing infrastructure using Envoy with an xDS control plane for endpoint discovery isn't susceptible to the problems that caused its May outage.
Lesson: People learn from mistakes. In a world of change, there are bound to be failures. Once a failure happens, learn from it and make sure you have architected a strategy towards a solution that does not allow the failure to happen again.
--
I cannot emphasize enough how impressive and disruptive an organization like Slack is. I have tremendous amounts of respect for the maturity of their engineering process. Furthermore, their willingness to publicly share their story showcases how confident and longstanding this practice and organization will be. I wish all of us worked for organizations that promote learning and improvement the way Slack has highlighted in the statements published from their event from May 12th 2020.
Notes: The content of this article came from the original post on ZDnet Titled “Slack: Our 'terrible, horrible day' when outage forced us into Zoom meeting set up by email” by Liam Tung. Quotes from Slack employees were taken from “All Hands on Deck” by Ryan Katkov as well as “A Terrible, Horrible, No-Good, Very Bad Day at Slack” by Laura Nolan
Lifetime pilot and aviation mentor, speaker, mechanic and forecaster. Enterprise Software too.....
4 年Great writeup Bill, thx!
Strategic Customer Success Manager | Accelerating Business Outcomes | Propelling Customer Experiences through Digital Transformation | Embracing a Growth Mindset
4 年Great blog with spot on lessons.!
SaaS Security & Posture Management
4 年Great post Bill!