Thoughts on massive IT meltdown

The recent CrowdStrike pushing a flawed update led to a global IT meltdown, showing the world how an IT outage has no boundaries and can lead to a complete standstill of the economy. The outage affected all airlines, health services, banks, the London Stock Exchange, 911 services in the US, and numerous businesses worldwide. This outage is considered one of the most significant IT outages in history, as most companies use Microsoft Windows as their operating system for both desktops and servers.

I have immense gratitude to all the IT and Cyber Security teams that had to work over the weekend to restore IT operations so that businesses around the world could get back to normal business operations. Also, to all the people throughout the world who were affected by this IT outage, I feel bad and feel sorry.

Let us continue to help the community to restore IT operations quickly.?Here are a few links to help individuals rapidly return systems to operations.

Restore IT operations guidelines

What happened?

Everyone here has lots of learning opportunities. Every failure teaches us something to learn from, improve, and move ahead. By embracing failure as a learning opportunity rather than a setback, individuals and organizations can gain valuable insights and improve their processes.

Do you think in the future, such massive IT outages will not be a reality crippling the world economy? The answer is no. IT outages of this scale will continue happening, but more importantly, how every company’s IT and cybersecurity organizations collaborate to evolve their IT resiliency processes will be more critical. Imagine the Ransomware situation; though the recent IT outage was not a Ransomware attack, if a playbook is readily available to restore essential business systems, you can get help from that playbook.

Security vs Business Value : What is essential to maintain the security posture or restore the IT functions when any IT outage occurs? Of course, the answer is restoring the business operations. Remember, the company exists because of its revenue, and IT and cyber security enable the business to run operations smoothly and securely.

Identify critical business systems and applications

Remember, when IT outages occur, you gather everybody into a Priority 1 incident bridge to understand the impact of the outage. The first question asked by C-level executives is, “What are the critical business systems that are directly impacted, due to which the business is losing revenue every second?” During such critical moments, the importance of a well-prepared organization becomes evident. If we have diligently matured our processes, tools, and guidelines, the recovery operations will be a streamlined process.

  • Do you have a playbook that identifies your applications?
  • Do you have information about the hardware and software components used in organizations documented in a standardized database such as a configuration management database (CMDD)?
  • Do you have a priority assigned to such applications, and what order must be followed to restore these critical applications? Retail organizations will consider point of sale (PoS) an essential system.

Business Continuity, Resiliency and Recoverability

  • Do you understand the dependency on other tech stacks? The identity systems that provide authentication and authorization services are important. If your Windows domain controller, which provides the authentication services, is affected, having a playbook to restore operations is a critical step.
  • Do you understand the importance of having a recovery plan for systems that provide foundational services to the organization? It's not just a good practice; it's a necessity.
  • Can you segregate the critical systems without applying patches at the same infra stack in one go?

Communications

  • Do you have a communication plan to inform everybody affected by the IT outage quickly? Email communication, Text messaging services etc.

Remember, gathering all the resources and recovering the processes in a large organization during such a massive IT outage is challenging. Predicting all the situations that can go wrong is also not possible, but what is possible is learning from each of these IT outages, taking time after the operations are stabilized to understand the gaps of failure in each of the IT departments, improving and iterating in improving those processes, tools and of course upscaling every individual within the team are few steps in the right direction. I wish everyone involved in recovering systems the best of luck.

Plans are worthless, but planning is everything - Dwight D. Eisenhower.

#CrowdStrike #Cybersecurity #Informationsecurity #Infosec #IT #globaloutage

I would like to chime in on the fact that this showed that EVERY organization impacted does NOT implement proper patch management procedures. You should test ALL patches before deploying them to every endpoint. Why aren't these IT departments doing this? (Not just CrowdStrike. Every organization affected should answer this question.)

回复
Christopher S.

Christian | Husband | Father | Problem Solver | Constant Learner

7 个月

I may be misreading your post, but how were 70% of the world’s desktops impacted? Surely CrowdStrike wasn’t running on all of them.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了