System Down, Team Up: Building Confidence and Capability Through Crisis Communications
Everything's a journey. And sometimes it's hard. West Point Inn, August 2024

System Down, Team Up: Building Confidence and Capability Through Crisis Communications

It was a lovely summer’s morning back in July 2023. My morning coffee was rudely interrupted when one of our systems failed in a way where the only path forward was for us to perform an emergency upgrade that would take my product out of service during normal business hours.?

"We have an Outage”, I thought. “Here’s our opportunity to get better". More on that later.

4 Key Questions

While much has been written about the CrowdStrike Outage, both during and after the event, let's focus on four critical questions leaders should ask of their ventures in the wake of such an impactful Outage:

  1. Do we know what constitutes an Outage?
  2. Do we know who needs to be informed?
  3. What's our communication plan?
  4. How do we develop a culture of reliability?

The "we" here includes anyone in Product, Engineering, Operations, Customer Service, Customer Success, Sales, Marketing, or Executive roles: all of us.

Question four is the one that really matters; but takes time to get there. Let’s go through the questions in order.?

Do you know what an Outage is??

Systems do fail but not every service interruption is an Outage. You need clear definitions:

  • Which systems are mission-critical?
  • What constitutes an Outage??

Consider:

  • Who is affected? Unavailability to one or more user groups
  • Is it still working? The system’s running, but returning wrong results
  • Is it too slow? The service’s performance has degraded beyond acceptable limits

For each of these questions, you need to measure and monitor what we call “heartbeat indicators” - key indicators of the system’s health & availability

There’s no one complete set of indicators to monitor.

How do you know which indicators to measure? Too often, it’s the ones that you wished you had when the system failed the last time, without notice.?

It’s commonplace - and good practice - to add new measurements to the set of heartbeat indicators in the wake of an? Outage. It’s like the warnings on the packaging of a coffee maker: a new one gets added every year or so, as new mishaps are reported.?

Key takeaway: most Outages are novel - you’ve never seen them before. Experience helps, but you will be surprised from time to time.?

So, first “do no harm”. Make sure you understand the root cause of the Outage. Instincts may cause people to look in the wrong place or, more troublingly, to not look beyond the first symptom they discover.?

Think about - and discuss - all the options for repair & resolution.

Do you know who needs to know? Having a Communications Plan

An Outage can be stressful and hectic. You need to have a communications plan; people need to know how to access it; and people need to be trained in how to use it.?

Here’s why it matters: In moments of stress, people will revert to the level of their training. Without a plan to follow, you risk the communications being unpredictable, intermittent or even, worst of all, wrong.?

Your communications plan should consider

  1. Content: What to tell stakeholders
  2. Cadence: How often to communicate
  3. Audience: Who needs to know
  4. Spokesperson: Ideally, one consistent voice

Tailor communications for different audiences (e.g., technical vs. executive). Coordinate all communications through one person to ensure consistency and avoid misinformation.

Here’s a Key Takeaway: Outages happen. Your stakeholders will judge you based on how you behave during your recovery. Having an incident communications plan helps you get it right.?

How do you develop a Reliability Culture?

Back in my telecom network days, I was taught: “Five Nines (99.999% availability) isn’t about system design but is an outcome of our culture”.

The gold standard here is Netflix, who invented the Chaos Monkey - causing production systems to randomly fail in order to identify weak spots and build resilience and reliability. (More reading here)

Two practical things to do, even if they’re not always easy.

  1. Encourage people to escalate a problem.?
  2. Lead by example: demonstrate that problems are, actually, ok.?

There are two mantras I’ve learned over my career:?

  1. “You’re not building if you’re not breaking”. Outages are an inevitable, regrettable, and hopefully rare consequence of change. Change is good.?
  2. "We break things by ourselves. We fix things together."?

Key Takeaway: an engineer “crying wolf” is a good thing. It's the sign of an improving reliability culture. Make it easy for people to raise the alert and have people who need to know weigh in on whether it’s truly an issue or not.?

Trusting the Team: A Case Study in building confidence and capability

Returning to the July 2023 incident, the Outage led to a testimonial on the impact of trusting your team.

This was one of those moments when you take a deep breath, organize yourself and then do the following:

  1. Think on how you’re going to set the context for the incident response team
  2. Remind yourself that the wheels have come off the cart before. The team can and will put them back on
  3. Pull out the Incident Response playbook. This is the time to follow the process, not remember the process?

So, I asked the engineer who told me about the problem to declare a P0 (Outage Priority Zero) and do all the other things that needed to happen. This included starting a dedicated Slack channel and kicking off a Huddle (similar to a Google Meet).

We had a quick discussion about what we knew, what we planned to do and how long we thought it’d take to resolve. We assigned roles and responsibilities. My role was Stakeholder Communications.

And then I offered the following: “You got this. I’m here to support you. But I know you can fix this without me leading the charge.” This was the culture-building moment.?

Thumbs up, all round; and then I left the Huddle

In leaving, I communicated to everyone involved that I trusted the team to get it done.

For sure, I checked back in. But it was the first time with that team that I’d left an in-progress incident.

After the Outage was resolved, we conducted a team retro. In the retro, one of the positives was that I stepped back from “leader” and let someone else be in charge.?

Key takeaway: trust the team to get it done. Make sure they’ve got the tools and the mandate to do the job. More than that: make sure they know you’ve got their back.?

Key Takeaways

  • Outages are a fact of life. Your stakeholders will judge you on how you behave during the recovery.?
  • Most Outages are novel. You will be surprised by how some of them came to happen. And there’s always going to be one more.?
  • An engineer “crying wolf” about a possible Outage is the sign of an improving reliability culture
  • Trust your team to get it done.

Where to from here? Request for Comments

In my next two articles, I'll be offering thoughts on

  • How do we improve reliability and support ownership - the "Crying Wolf" conundrum?
  • How do we monitor performance & health in Analytics & Dashboarding systems?

I'd love to hear from you. You can email me ([email protected]) or set up time to talk, here



Thanks to Shiqi Zhao and Ameya Bhope for their reviews of early drafts.


Anubhav Agrawal

Founder | Personal Branding | Digital Marketing

6 个月

Brilliant! any more insights?

回复
Jeroen Erné

Teaching Ai @ CompleteAiTraining.com | Building AI Solutions @ Nexibeo.com

6 个月

Great insights! Crisis communication truly transforms challenges into opportunities for growth. I recently explored this in detail in my article on mastering such communication: https://completeaitraining.com/blog/a-guide-to-mastering-crisis-communication-build-team-confidence-in-uncertain-times. Let's keep the conversation going!

Hillary Read

??Marketing & Branding Leader | Revenue Growth for $30-100M SaaS, B2B, & B2C Orgs | Powerful Company & Executive Branding

6 个月

Best breakdown I've seen on this.

要查看或添加评论,请登录

Dave Holmes-Kinsella的更多文章

社区洞察

其他会员也浏览了