System Down, Team Up: Building Confidence and Capability Through Crisis Communications
Dave Holmes-Kinsella
Builder | Analytics & Data Leader: Strategy, Architecture, Build, Launch | From pre-A to post-IPO | 2 Exits | Former Synctera, Facebook
It was a lovely summer’s morning back in July 2023. My morning coffee was rudely interrupted when one of our systems failed in a way where the only path forward was for us to perform an emergency upgrade that would take my product out of service during normal business hours.?
"We have an Outage”, I thought. “Here’s our opportunity to get better". More on that later.
4 Key Questions
While much has been written about the CrowdStrike Outage, both during and after the event, let's focus on four critical questions leaders should ask of their ventures in the wake of such an impactful Outage:
The "we" here includes anyone in Product, Engineering, Operations, Customer Service, Customer Success, Sales, Marketing, or Executive roles: all of us.
Question four is the one that really matters; but takes time to get there. Let’s go through the questions in order.?
Do you know what an Outage is??
Systems do fail but not every service interruption is an Outage. You need clear definitions:
Consider:
For each of these questions, you need to measure and monitor what we call “heartbeat indicators” - key indicators of the system’s health & availability
There’s no one complete set of indicators to monitor.
How do you know which indicators to measure? Too often, it’s the ones that you wished you had when the system failed the last time, without notice.?
It’s commonplace - and good practice - to add new measurements to the set of heartbeat indicators in the wake of an? Outage. It’s like the warnings on the packaging of a coffee maker: a new one gets added every year or so, as new mishaps are reported.?
Key takeaway: most Outages are novel - you’ve never seen them before. Experience helps, but you will be surprised from time to time.?
So, first “do no harm”. Make sure you understand the root cause of the Outage. Instincts may cause people to look in the wrong place or, more troublingly, to not look beyond the first symptom they discover.?
Think about - and discuss - all the options for repair & resolution.
Do you know who needs to know? Having a Communications Plan
An Outage can be stressful and hectic. You need to have a communications plan; people need to know how to access it; and people need to be trained in how to use it.?
Here’s why it matters: In moments of stress, people will revert to the level of their training. Without a plan to follow, you risk the communications being unpredictable, intermittent or even, worst of all, wrong.?
Your communications plan should consider
Tailor communications for different audiences (e.g., technical vs. executive). Coordinate all communications through one person to ensure consistency and avoid misinformation.
Here’s a Key Takeaway: Outages happen. Your stakeholders will judge you based on how you behave during your recovery. Having an incident communications plan helps you get it right.?
How do you develop a Reliability Culture?
Back in my telecom network days, I was taught: “Five Nines (99.999% availability) isn’t about system design but is an outcome of our culture”.
领英推荐
The gold standard here is Netflix, who invented the Chaos Monkey - causing production systems to randomly fail in order to identify weak spots and build resilience and reliability. (More reading here)
Two practical things to do, even if they’re not always easy.
There are two mantras I’ve learned over my career:?
Key Takeaway: an engineer “crying wolf” is a good thing. It's the sign of an improving reliability culture. Make it easy for people to raise the alert and have people who need to know weigh in on whether it’s truly an issue or not.?
Trusting the Team: A Case Study in building confidence and capability
Returning to the July 2023 incident, the Outage led to a testimonial on the impact of trusting your team.
This was one of those moments when you take a deep breath, organize yourself and then do the following:
So, I asked the engineer who told me about the problem to declare a P0 (Outage Priority Zero) and do all the other things that needed to happen. This included starting a dedicated Slack channel and kicking off a Huddle (similar to a Google Meet).
We had a quick discussion about what we knew, what we planned to do and how long we thought it’d take to resolve. We assigned roles and responsibilities. My role was Stakeholder Communications.
And then I offered the following: “You got this. I’m here to support you. But I know you can fix this without me leading the charge.” This was the culture-building moment.?
Thumbs up, all round; and then I left the Huddle
In leaving, I communicated to everyone involved that I trusted the team to get it done.
For sure, I checked back in. But it was the first time with that team that I’d left an in-progress incident.
After the Outage was resolved, we conducted a team retro. In the retro, one of the positives was that I stepped back from “leader” and let someone else be in charge.?
Key takeaway: trust the team to get it done. Make sure they’ve got the tools and the mandate to do the job. More than that: make sure they know you’ve got their back.?
Key Takeaways
Where to from here? Request for Comments
In my next two articles, I'll be offering thoughts on
I'd love to hear from you. You can email me ([email protected]) or set up time to talk, here
Thanks to Shiqi Zhao and Ameya Bhope for their reviews of early drafts.
Founder | Personal Branding | Digital Marketing
6 个月Brilliant! any more insights?
Teaching Ai @ CompleteAiTraining.com | Building AI Solutions @ Nexibeo.com
6 个月Great insights! Crisis communication truly transforms challenges into opportunities for growth. I recently explored this in detail in my article on mastering such communication: https://completeaitraining.com/blog/a-guide-to-mastering-crisis-communication-build-team-confidence-in-uncertain-times. Let's keep the conversation going!
??Marketing & Branding Leader | Revenue Growth for $30-100M SaaS, B2B, & B2C Orgs | Powerful Company & Executive Branding
6 个月Best breakdown I've seen on this.