When the World Goes Dark: A Ex-CTO's Guide to Crisis Management

When the World Goes Dark: A Ex-CTO's Guide to Crisis Management

As an Ex-CTO with over 35 years of experience in live sports technology, I've seen my fair share of crises. But the global IT outage that occurred on July 19, 2024, is unprecedented in its scale and impact. As tech leaders, our response in these critical moments can make or break not just our organizations, but potentially affect millions of lives. Here's my perspective on how to navigate such a monumental crisis.

The Situation

On July 19, 2024, a catastrophic global IT outage struck, affecting airlines, healthcare providers, government agencies, emergency services, and countless businesses worldwide. The cause: a faulty security update from a major endpoint security provider, resulting in widespread system failures and the infamous "Blue Screen of Death" on Windows systems.

The Live Sports Tech Perspective

In the world of live sports technology, we operate under a unique kind of pressure. When a technical issue arises, it has the potential to halt a worldwide event that millions of people are watching in real-time. Imagine the Super Bowl, the World Cup final, or the Olympic Games opening ceremony suddenly going dark because of an IT failure. The stakes are incredibly high, and the spotlight is intense.

This environment has taught us that when a crisis hits, there's only one objective: solve the problem, no matter what. The show must go on, and it's our job to make sure it does. This mentality of complete dedication to problem-solving is crucial in any major IT crisis, regardless of the industry.

The CTO's Priority List

In a crisis of this magnitude, it's crucial to maintain laser focus on the most critical tasks. Here's what should be at the top of every CTO's priority list:

1. Fix the Bug - Now

The primary objective is crystal clear: fix the bug causing the outage. This is not the time for lengthy meetings, blame games, or hypothetical discussions about future prevention. Every second counts, and your entire technical team should be mobilized with a single mission - get systems back online.

  • Assemble your best engineers and developers
  • Establish a war room (physical or virtual) for continuous communication
  • Set up a clear chain of command for decision-making
  • Implement rapid testing and deployment protocols

2. Understand and Execute the Fix

Once you've identified the fix, it's crucial to understand it thoroughly and execute it as quickly and efficiently as possible. This involves:

  • Documenting the exact steps required to implement the fix
  • Creating a streamlined process for rolling out the fix across affected systems
  • Setting up a triage system to prioritize critical infrastructure and services
  • Establishing clear communication channels with IT teams across your organization and potentially with other affected organizations

3. Focus on Speed and Completeness

In a situation like this, perfection is the enemy of good. Your goal is to get systems operational as quickly as possible while ensuring the fix is complete enough to prevent immediate recurrence.

  • Implement the fix in phases if necessary, starting with the most critical systems
  • Set up monitoring systems to quickly identify any issues with the fix
  • Be prepared to roll back if unforeseen complications arise
  • Keep stakeholders informed of progress, but don't let status updates slow down the work

4. Coordinate with External Parties

In a global outage like this, where the issue impacts systems not under your direct control, coordinating with external parties becomes a crucial part of the solution. This is similar to how we in live sports tech often need to work with broadcasters, venue technicians, and equipment manufacturers during a crisis.

  • Establish direct lines of communication with the affected security provider and other key vendors
  • Share information and resources to expedite the fix across all affected systems
  • Collaborate on testing and implementation strategies
  • Ensure consistent communication to avoid conflicting instructions or duplicate efforts

What Not to Do

It's equally important to know what to avoid during crisis resolution:

  1. Don't Waste Time on Blame: There will be time for post-mortems and accountability later. Right now, your energy should be entirely focused on resolution.
  2. Avoid Speculation: Don't get drawn into discussions about how this happened or how to prevent future occurrences. These are important topics, but not while you're in the thick of crisis management.
  3. Don't Neglect Communication: While your focus should be on the fix, remember to keep key stakeholders informed. Designate a team member to handle external communications so your technical team can stay focused.
  4. Avoid Scope Creep: Stick to fixing the immediate issue. Don't let discussions about system upgrades or overhauls distract from the urgent task at hand.

The Road Ahead

Once systems are back online and stable, then we can turn our attention to understanding root causes, implementing preventative measures, and improving our crisis response protocols. But in the heat of the moment, our singular focus must be on resolution.

As technology leaders, we bear a tremendous responsibility. In times of crisis, our ability to focus, act decisively, and lead with clarity can make all the difference. In live sports tech, we live by the mantra "the show must go on." Apply this same level of urgency and dedication to any critical IT issue. Stay focused, stay calm, and let's get those systems


This is episode No. 154 of my LinkedIn newsletter, A guy with a scarf .

Subscribe here: https://lnkd.in/ddmvMF-Q


A guy with a scarf

LinkedIn newsletter: subscribe here

YouTube channel: watch here

Podcast: listen here


Follow my Retention Zone show!

Buy my books on eSports and AI: https://www.amazon.it/dp/B0CFSHKRKR


A FACTORY63 initiative

Produced by Carlo De Marchis


Carlo De Marchis

Advisor. 35+ years in sports & media tech. "A guy with a scarf" Public speaker. C-suite, strategy, product, innovation, OTT, digital, B2B/D2C marketing, AI/ML.

4 个月

Subscribe to the A guy with a scarf newsletter for more: https://www.dhirubhai.net/newsletters/a-guy-with-a-scarf-6998145822441775104/

回复
Mark Alan Bartholomew

Applied physics.(JOIN ME) the work presented here is entirely new

4 个月

In speaking with a friend from Microsoft, this does not bode well for Crowdstrike. Someone's going to lose their job. And although only one percent of their market was affected, that's still millions of users. Some users are still trying to recover. Some data may be lost. It depends. A pointer.... to some null space.... now wrapped up in a loop, within C++, .... could it have been a hack? Yes, it could have been someone paid to release the error. What is the biggest vulnerability experienced in light of this IT affair,.... ? The biggest problem we face may not be from the error and shutdown of 8 million systems... it may be the distraction it provided, to then hack other systems.... this is the biggest yet unrealized threat. Did this affect our militaries? NO, they use a different, GCC system. Could this error have been the result of Artificial Intelligence and coding therein? YES, it very well may have been. CAN WE TRUST ARTIFICIAL INTELLIGENT SYSTEMS....? NO. I THINK THIS is the resounding answer..... NO. And the worst may yet be still to come...... MARK applied physics

回复
Mark Alan Bartholomew

Applied physics.(JOIN ME) the work presented here is entirely new

4 个月

What a boon for the cybersecurity industry. Who's ever trusted Microsoft? I left their products thirteen years ago. Have not had a problem since. In your opinion,... is this the result of Artificial Intelligent systems creating code? Kindest, MARK applied physics

回复
Carlo De Marchis

Advisor. 35+ years in sports & media tech. "A guy with a scarf" Public speaker. C-suite, strategy, product, innovation, OTT, digital, B2B/D2C marketing, AI/ML.

4 个月

Sport sponsorship

  • 该图片无替代文字
回复
Anthony Smith-Chaigneau

Experienced business development, marketing & creative executive

4 个月

Common Sense

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了