Better Lucky Than Not
AI Generated picture of what last week looked like.

Better Lucky Than Not

Last week, we were all reminded how fragile and interconnected the entire world really is.? A series of events and failures brought down large segments of cloud and hybrid cloud infrastructure.? This was not a “cyberattack,” or Dr. Evil’s first step in his plan for world domination; it was simply a series of configuration issues that caused a cascade of failures.

Here’s a quick recap of the events as they happened: At approximately 6 pm EST on 7/18/2024, a configuration error made by Microsoft caused a system-wide issue with Office 365 applications. As this configuration made its way through the various regional systems, users reported issues with the following services:? Microsoft Defender, Intune, Teams, PowerBI, Fabric, OneNote, OneDrive for Business, SharePoint Online, Windows 365, Viva Engage, Microsoft Purview, and the Microsoft 365 admin center. In a possibly related incident, Azure services (Also Microsoft) in the Central Region lost communications with the backend cluster management workflow due to a bad configuration change. We do not know if this was the same configuration change that caused the Office365 issues or if this was a separate issue. Then, ?around 4:30 am EST 7/19/2024. Microsoft issued the following alert: “We have been made aware of an issue impacting Virtual Machines running Windows, running the CrowdStrike Falcon agent, which may encounter a bug check (“Blue Screen of Death”) and get stuck in a restarting state. We are aware of this issue and are investigating potential options Azure customers can take for mitigation.” “This issue has caused systems worldwide to crash and the cascading issue has caused major issues worldwide.” As the configuration error made its way through the world, government agencies, hospitals, airlines, and many more industries and companies were offline.? IT professionals were left scrambling for information and a path out of the issue. Many of these companies had done everything correctly but fell victim to the outage.? Some major third-party providers were left without the ability to communicate internally or externally.? Within minutes of the reports of the outage, the internet was saturated with memes made by Linux and Mac users laughing at Microsoft users, joking about Microsoft’s history of issues related to DNS. ?Within the first hour, the extent of the damage was starting to appear. The issue became apparent to the public as ATM cards stopped working and many online activities (shopping, banking, messaging, etc.) became impossible due to downed systems. For people who were traveling during the outage, the widescale disarray took on a more serious tone, as over 2,800 U.S. flights were canceled and almost 10,000 others were delayed. Suddenly the humor changed to: “Holy $h!t! - what do we do now??!!” ?IT support desks were flooded with calls, and the vastly increased volume caused a worldwide slowdown in response to other issues, further exacerbating the problem.

As tech workers scrambled to put out fires, things unrelated to the outage began having issues because of a lack of support capacity. Companies that outsourced all support found themselves needing onsite staff, and companies that provide offsite support needed onsite hands-on resources.

So, what did we learn?

Disaster recovery (DR) plans without practice are just plans. Many companies have extremely well thought out DR plans. Some companies even have fully integrated business continuity plans that have a DR plan for tech operations. These companies have spent countless dollars on these plans, but do they practice them? I am not talking about the Saturday morning failover test or the “coffee and a pastry” in the conference room tabletop exercise… I am talking about actually trying to conduct business without major parts of your infrastructure, or with major partners offline. It is often easy to forget that most companies rely on a considerable amount of infrastructure that is not under their control to conduct business every day. This most recent series of events started with a configuration error.

?

This isn’t the first time a tech issue resulted in a cascading effect that caused global problems. When hurricane Sandy hit New Jersey, the damage to the global fiber optic network junctions (transatlantic fiber) caused massive slowdowns and outages across the world. The list of these events goes on and on. As the world becomes more connected, these events have wider impact.

?

When you review your DR plans, you should include plans for third-party systems and outages that you cannot control.? A good start is a true critical systems list and an impact review of losing one or all of them. Then use this list to understand the true cost and impact of losing them. When I speak of cost, I am not just talking about monetary cost- I am talking about the impacts on your reputation and human cost on your staff.

Communication is usually the first casualty What made this event so much more chaotic is that misinformation loves a vacuum. Rumors and downright wrong information were being transmitted everywhere. Conspiracy theories emerged involving everything from the presidential race to dark money stock market moves. Meanwhile, the tech leaders with actual information were initially hard to come by.? As the day went on, communication became a combination of CYA and reputation saving. We even heard that some vendors who were offline stayed completely silent until they had restored service. Other vendors refused to comment on the extent of damage.

?

As IT professionals and business leaders, we need to swallow the hard pill and understand that for the most part, this could have happened to any of us.? We have to stand up and give accurate and complete information as soon as we know what that is. The goal for all of us should be to get a handle on the issue and find the solution. We can’t delight in others’ pain (We truly are all in this together.? Just because it wasn’t your turn today, it doesn’t mean it won’t be tomorrow.? We are all targets.)

?

Idle speculation in the press and online added to confusion and delayed the response. Internal response to an event like this is critical when you are the victim or target, but much more so when you are not. If a company’s systems are not affected, it is important for IT staff to let users and partners know. Confusion and panic cause employees to make hasty decisions; this fact is well-known and often exploited by cybercriminals. Within minutes of the Microsoft outage being made public, bad actors were offering to scan networks and help protect users from the “attack.”

Prepare and Plan for the worst If a worldwide event that outside of your control strikes -for example, several major internet exchanges go dark or massive power grid failures occur- what is the plan? Remember, major mission critical systems that “cannot go offline” will eventually go offline. Proper planning will help reduce impact of downtime and ensure recovery. If you plan as if it cannot happen, you are likely to be setting yourself up for an unpleasant surprise.

?

The question is not whether a system cannot go offline or not, the question is…

·???????? How much of an issue is the downtime?

·???????? What will it impact?

·???????? How can IT staff disseminate clear and honest information to users regarding an outage if one occurs?

In basic terms downtime equals cost and cost can be quantified. Some companies may decide the cost of downtime is less than the cost of mitigation. In such a scenario, the best course of action is to shut down until the issue is resolved. ?The key is to know what you are going to do before an outage occurs. ? One last thing…

When it happens, take a deep breath and ask all the screaming people in the room to stop, and do the same thing. “Operating in Chaos” is scary and stressful, and knee jerk reactions often lead to bigger issues of longer response times. For example, when this issue first started, many IT professionals did the first thing they thought of and rebooted systems, which may have made things much worse in some cases. Hindsight is always 20/20, and a proper understanding of the scope and impact of the outage may have made a difference in the extent and impact.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了