How Slack automates deploys

How Slack automates deploys

This is the latest issue of my newsletter. Each week I share research and perspectives on developer productivity. Subscribe here to get future issues.


This newsletter typically shares research on what impacts developer productivity. Today, I’m putting the spotlight on a team at Slack that is putting some of these ideas into practice.??

Slack runs on a large monolithic service where hundreds of developers implement hundreds of changes each week. Naturally, continuous deployment at this scale is a challenge. As the company has grown, deploys started to become increasingly difficult and manual.

Slack’s Release Engineering team has transformed the deployment process from a manual one involving “Deploy Commanders” to a fully automated process using anomaly detection powered by Z-scores. In today's issue we hear from Sean Mcilroy, from the company’s Release Engineering team, as he dives into how this transformation happened.?

This story is a great example of a team that streamlined the developer experience through thoughtful automation and tools.

You can also listen to my full interview with Sean on the Engineering Enablement podcast.

About Slack’s Release Engineering team

Slack’s Release Engineering team is a team of five that owns various pieces of the infrastructure around building and deploying Slack’s monolithic service, which is called “The Webapp.” Most of Slack runs on the Webapp. ‘It’s a gigabytes large, mono-repo with hundreds of developers working on it every day, so getting this thing built and continually deployed has its own unique set of challenges. That’s why we have a team dedicated mostly to focusing on just it.’

How the deployment process worked previously

Deploys used to be managed using Deploy Commanders (DCs). There were hundreds of DCs, who were trained volunteers. DCs would be put on a rotation to work a 2 hour shift where they’d walk the Webapp through its deployment steps, watching dashboards and executing manual tests along the way. Slack’s Release Engineering team managed the deployment tooling, dashboards, and the DC schedule.?

The challenge was that DCs didn’t feel confident making decisions. The Release team heard this feedback frequently. ‘Technically the process worked, but it was stressful. It’s difficult to monitor the deployment of a system this large. Also, DCs were on a rotation with hundreds of other developers. How do you get comfortable with a system that you may only interact with every few months? What’s normal? What do you do if something goes wrong? We had training and documentation, but it’s impossible to cover every edge case. Deploying something this large that affects this much of the company was an uncomfortable process for people.’

Changing the process?

Initially, the Release Engineering team focused on giving Deploy Commanders better signals. Automating the process wasn’t yet on their radar. ‘The joke was that we were just going to have a single UI with a red button and a green button for go or no go, and that’s all deploy commanders would watch. That was the abstract goal, to have something that said ‘things are bad or things aren’t bad’ so DCs don’t have to be experts on a million different things.”

The team sought to answer the question: How do you determine if things are good or bad? After some investigation, they decided to use anomaly detection with Z-scores. This approach is intuitive: during a deployment, DCs are monitoring dashboards for any noticeable differences. They’re looking for a spike or an anomaly. To make this easier, the Release Engineering team created ReleaseBot, a tool that detects anomalies and alerts DCs when something unusual happens.

Anomaly detection

Slack has two ways of detecting anomalies: Z-scores and thresholds.

Z-scores are a way to mathematically detect a spike in a graph. ‘A Z-score is the data point you’re worried about minus the mean of the all the data points, divided by the standard deviation. What that equation gives you is the size of a spike in the graph.’ The larger the Z-score, the larger the outlier.?

Static thresholds: After implementing Z-scores in the ReleaseBot, the team needed to implement the ability to detect threshold breaches and then send a signal to their automation. If a threshold was breached, ReleaseBot would automatically stop deployments and send a notification to a Slack channel.

Release Engineering needed to test the thresholds: ‘We had to play with it a bit. We’d look at the data for past incidents and say ‘Okay, we see a big spike. What was the Z-score for that spike?’ For us, we chose a threshold of 3 and/or -3. 3 generally represents a datapoint above the 99th percentile, so we know that’s a very large, bad spike.’

'So if we have a graph that’s been hanging out between 1 and 3 for 3 hours, a jump to 5.5 would have a Z-score of 3.37. This is a threshold breach. Our metric only increased by 2.5 in absolute numerical terms, but that jump was a huge statistical outlier. It wasn’t a big jump, but it was definitely an unusual jump.’

Slack has found that Z-scores are nearly always worth looking at, even when they trigger an alarm and then quickly return to normal levels. However, these static thresholds were still too noisy to rely on for paging people. That’s where dynamic thresholds come in.

Dynamic thresholds allow you to keep using static thresholds while filtering out normal behavior. They calculate an average based on representative historical data.

To understand how they work, let’s use an example: a database team deploys a component every Wednesday at 3 p.m. During this deployment, database errors temporarily spike above your alert threshold, but your application manages them smoothly. Because of this, users don't notice any errors, so there's no need to halt deployments.

Dynamic thresholds are set using an average from relevant historical data. Deciding what is ‘relevant’ historical data has some nuance. ReleaseBot will pull data from multiple days at relevant times. For example, if it’s 6pm on Wednesday, ReleaseBot will pull data from the past six hours (today), the same time window yesterday, and the same time window last Wednesday. Then it’ll take the average from those numbers. That’s how the Release Engineering team handled the noisiness of the static thresholds.?

This information is delivered in two ways. Most people interact with the system through Slack: when there is a series of events, ReleaseBot posts updates in the deploys channel. To avoid spamming the channel, ReleaseBot will thread messages if multiple events are happening at the same time. There is also a Release Web UI where users can check the status of the current deployment, but most users primarily interact with the Slack bot.

For more details on how exactly Slack’s Z-scores and dynamic thresholds work, go here. Sean also shares more context about the decisions they made in this interview.?

Rolling out the ReleaseBot

Release Engineering developed ReleaseBot for a quarter and ran it alongside the Deploy Commanders for another quarter. Initially, only the Release Engineering team received alerts, but they soon added DCs to the channel. ‘We have fantastic engineers, but you can’t compete with a computer as far as timing goes. The bot would always be sending messages before the human came in to say something looked weird.’

Before long, the DCs began to rely increasingly on the bot. 'They’d sort of let the deploy run and take a step back, and so we started thinking, ‘Why are we scheduling people for this anymore, instead of just letting the bot push the button?' The humans aren’t adding much at this point when they’re just fully trusting it to do the monitoring for them, so let’s use the bot.” This is when Release Engineering realized ReleaseBot could be trusted to handle deployments on its own.

That was a big step. ‘Our team ended up taking on more of those Deploy Commander responsibilities, for example if the bot was acting up or if an incident did need to be called. but because we didn’t need somebody sitting there and watching it 24/7 anymore, we also didn’t need to sit there. The bot, at this point, will stop deployments, and then just page our team if it thinks it needs help. From there, it’s allowed us to focus on more deploy safety work.’

For example, they’ve started automatically rolling back if there’s an issue detecting. 'We’re encouraging people to hotfix less and rollback more.’ They’ve even made rollbacks easier: when an issue is detecting, ReleaseBot will send a message that says it stopped deploys, and will show a rollback button that people can push. ‘Rollbacks are just safer and faster.’

Final thoughts

I love Slack’s story as an example of a team that made a process safer and easier for developers. By listening to feedback and refining their solution, they developed a tool that eliminated the hours developers previously spent on DC rotation.?


Who’s hiring right now

Here’s a roundup of recent Developer Experience job openings. Find other open roles here.


That’s it for this week. Thanks for reading.

-Abi


David Meacock

Enabling great engineers and great teams to do great things

5 个月

It's a great journey to talk about and one that really shows the power of understanding where you are and building up improvements to this much more streamlined world. The only thing I'd really challenge is the idea that "you probably have all the metrics you need today". I think quite a few companies I've worked with are still at the point of trying to get the visibility that they need, whether you're talking the product or the system view.

Jen C.

Software Developer ????????♀? Tech Lead

5 个月

There is a world where people aren't automating deploys, I keep reminding myself...

要查看或添加评论,请登录

社区洞察

其他会员也浏览了