Perspective of Global Microsoft and Cloudstrike Outage from an OT Security Professional with extensive IT experience.

Perspective of Global Microsoft and Cloudstrike Outage from an OT Security Professional with extensive IT experience.

By Michelle Balderson July 19th, 2024

Today, the world has experienced what may be the largest computer outage in history. Many Microsoft services were hit by a cascading failure that escalated into a catastrophic outage of Azure-delivered services, impacting Office 365 and many other services. Initially, these services were dealing with a manageable issue that would have been resolved by the deployed resilient systems. However, a recent update to Cloudstrike's Falcon EDR software seemingly caused kernel panics, leading to widespread blue screen of death (BSOD) errors. This rendered numerous machines inoperable until IT departments could intervene and restore functionality.

My primary concern is that many of these BSOD-affected machines have been deployed in industrial applications, impacting numerous industries. As an Operational Technology (OT) professional, this suggests that many clients have received poor advice regarding deployment architectures and best practices for business-critical and industrial applications. These industrial machines should never be connected to the internet. Updates should be applied offline after extensive testing to ensure that the update is thoroughly vetted. Considering the critical nature of these machines, engineering best practices must be followed.

It is imperative that Information Technology (IT) teams, Operational Technology teams, and Engineering work together to ensure properly architected solutions are deployed that focus on the availability and resiliency of systems. This outage demonstrates that many systems being deployed in OT are following IT best practices not OT or Engineering best practices, failing to ensure that systems are not affected by erroneous updates of this nature.

The long-term impact of this outage on IT systems will depend heavily on the preparedness of the affected IT departments. The older the impacted OS, the worse the situation could potentially be. On the other hand, Operational Technology devices could be impacted much longer because they likely do not have the same degree of preparedness as IT to deal with machines, embedded systems, and their respective operating systems. OT companies often work with industrial controls vendors that embed computers into their systems and do not have access to the machine or the operating systems. This is also true of handheld machines and POS terminals that embed Microsoft products into their solutions. Additionally, industrial applications are not designed to handle sudden stops (such as a BSOD), which could render the devices inoperable or "bricked," requiring a physical replacement. Abrupt stops can have cascading effects on downstream devices and systems, leading to prolonged recovery times.

Actions to Ensure Resilience in IT Systems and Best Practices

  • Test updates thoroughly before pushing them to machines in the IT environment.
  • Reduce complexity of systems.
  • Establish testing and response plans to ensure teams are prepared to respond timely to incidents.
  • Reduce system exposure by segmenting systems based on criticality to the business.
  • Understand the business context of systems and apply stringent controls to critical operational systems.
  • Engage trusted professional and partners that are qualified and certified in architecture of IT systems.

Actions to Ensure Resilience in OT Systems and Best Practices

  • Document the criticality of systems.
  • Follow engineering best practices, involving all stakeholders to make balanced decisions on system operations focusing on availability and resilience.
  • Avoid connecting critical systems to the Internet; use intermediate systems or DMZs to share data between business domains.
  • Segment IoT systems away from process automation and control, as IoT often requires Internet access for data collection.
  • Use data collection and aggregation to establish data lakes for shared business data analysis while keeping real-time data operations separate and unaffected by IT.
  • Develop OT-specific incident planning and response plans.
  • Thoroughly test all updates before deploying them to systems.
  • Design and engineer solutions specific to each domain, acknowledging that IT practices do not always work in OT environments.
  • Deploy Software and hardware solutions that are designed and validated for use within OT and Industrial control systems.
  • Engage trusted professionals and partners that are qualified and certified in architecture of OT systems.?

?

Daniel Ehrenreich

Leading ICS-OT-IIOT Cyber Security Expert, Consultant, Workshops Lecturer, International Keynote Speaker

7 个月

Very clear paper Michelle Balderson, I appreciate that you direct this to OT experts. I have a small suggestion for enhancing the benefits of your post; I suggest directing it to vendors and experts who negligently and blindly use the term "IT-OT Convergence".

Michael G.

MSEE, Consulting Systems Engineer at Aruba Networks | Episodical Field CTO

7 个月

There was no Microsoft outage.

Emily Crain

Emerging Tech | Transformative Leader | Stevie Award Winner

7 个月

This is when being able to communicate effectively with large numbers of people in a very targeted way on devices like employee mobile apps to covey the continuity response plan, and how to resume business becomes of utmost importance. Having the right tools like Firstup to intelligently deliver messges make a huge difference in carrying out these plans successfully and bringing calm to your people.

回复

Most of all, use the simplest operating system and paradigm possible to solve the task. Don't run a web application on a Windows cloud service, if bare metal programming against an Ethernet transciever could get the same job done with less effort and more performance. Complexity leads to bugs, fragility and security problems. A lot more developers need to learn hardware design and assembler, instead of inventing yet another way to bring the entire browser stack everywhere they go.

要查看或添加评论,请登录

Michelle Balderson的更多文章

社区洞察

其他会员也浏览了