Perspective of Global Microsoft and Cloudstrike Outage from an OT Security Professional with extensive IT experience.
Michelle Balderson
Unique, passionate, highly motivated individual. A thought Leader in addressing Risk in Operational Environments.
By Michelle Balderson July 19th, 2024
Today, the world has experienced what may be the largest computer outage in history. Many Microsoft services were hit by a cascading failure that escalated into a catastrophic outage of Azure-delivered services, impacting Office 365 and many other services. Initially, these services were dealing with a manageable issue that would have been resolved by the deployed resilient systems. However, a recent update to Cloudstrike's Falcon EDR software seemingly caused kernel panics, leading to widespread blue screen of death (BSOD) errors. This rendered numerous machines inoperable until IT departments could intervene and restore functionality.
My primary concern is that many of these BSOD-affected machines have been deployed in industrial applications, impacting numerous industries. As an Operational Technology (OT) professional, this suggests that many clients have received poor advice regarding deployment architectures and best practices for business-critical and industrial applications. These industrial machines should never be connected to the internet. Updates should be applied offline after extensive testing to ensure that the update is thoroughly vetted. Considering the critical nature of these machines, engineering best practices must be followed.
It is imperative that Information Technology (IT) teams, Operational Technology teams, and Engineering work together to ensure properly architected solutions are deployed that focus on the availability and resiliency of systems. This outage demonstrates that many systems being deployed in OT are following IT best practices not OT or Engineering best practices, failing to ensure that systems are not affected by erroneous updates of this nature.
The long-term impact of this outage on IT systems will depend heavily on the preparedness of the affected IT departments. The older the impacted OS, the worse the situation could potentially be. On the other hand, Operational Technology devices could be impacted much longer because they likely do not have the same degree of preparedness as IT to deal with machines, embedded systems, and their respective operating systems. OT companies often work with industrial controls vendors that embed computers into their systems and do not have access to the machine or the operating systems. This is also true of handheld machines and POS terminals that embed Microsoft products into their solutions. Additionally, industrial applications are not designed to handle sudden stops (such as a BSOD), which could render the devices inoperable or "bricked," requiring a physical replacement. Abrupt stops can have cascading effects on downstream devices and systems, leading to prolonged recovery times.
Actions to Ensure Resilience in IT Systems and Best Practices
Actions to Ensure Resilience in OT Systems and Best Practices
?
Leading ICS-OT-IIOT Cyber Security Expert, Consultant, Workshops Lecturer, International Keynote Speaker
7 个月Very clear paper Michelle Balderson, I appreciate that you direct this to OT experts. I have a small suggestion for enhancing the benefits of your post; I suggest directing it to vendors and experts who negligently and blindly use the term "IT-OT Convergence".
MSEE, Consulting Systems Engineer at Aruba Networks | Episodical Field CTO
7 个月There was no Microsoft outage.
Emerging Tech | Transformative Leader | Stevie Award Winner
7 个月This is when being able to communicate effectively with large numbers of people in a very targeted way on devices like employee mobile apps to covey the continuity response plan, and how to resume business becomes of utmost importance. Having the right tools like Firstup to intelligently deliver messges make a huge difference in carrying out these plans successfully and bringing calm to your people.
Programmer
7 个月Most of all, use the simplest operating system and paradigm possible to solve the task. Don't run a web application on a Windows cloud service, if bare metal programming against an Ethernet transciever could get the same job done with less effort and more performance. Complexity leads to bugs, fragility and security problems. A lot more developers need to learn hardware design and assembler, instead of inventing yet another way to bring the entire browser stack everywhere they go.