BSOD due to a software bug?

BSOD due to a software bug?

The global impact of our reliance on technology was starkly demonstrated when an erroneous update caused widespread system failures, disrupting operations across various sectors worldwide.

While the recent issue is resolved, it's crucial to recognize that this is not an isolated incident. We must proactively strategize and prepare to prevent such outages in the future, as the potential for recurrence is a stark reality that we cannot afford to ignore.

The parties involved must investigate why this happened to find a practical solution. On one side, it may be due to a bug in the software that got pushed; however, it is always possible to have some bug in the system even with the best testing competency, so why was this not caught during the first few system updates? Also, it continued to push a worldwide system without anyone noticing.

There is some fault in how Microsoft uses these updates from CrowdStrike. Ideally, there should be some level of validation, and also, when a few systems got updated, an alert should have been raised to avoid further damage. It appears that updates were pushed unhindered across primary and DR sites.

There is also a question on how the change is included in data centre operations. Globally, the CIO must ensure that the data centres don't get updates without fully knowing their impact. Sometimes, security changes get the highest precedency, but they should not be at the cost of bringing down the whole data centre. There is also a need to have critical applications running on multiple operating systems; at least the primary and DR sites on different vendor tech would be a better idea.

This issue returns the focus to foundational topics, such as managing a critical system and leveraging vendors or third-party systems to manage the business. If these kinds of outages are unacceptable, companies need to invest truly in a DR site and ensure it works independently. If a working DR site existed, handwritten boarding passes would not have been circulating online.

The business sectors that rely on technology to operate effectively must critically plan their investment in technology, people, and processes. These investments are not just beneficial but necessary for the smooth functioning of our operations.

This outage has raised doubts in the general public, and ensuring that such fundamental issues are addressed will be essential. It will be easier to pinpoint the corrective actions if we know why so many systems were updated unnoticed and impacted global airline and banking operations.

We must keep the basics correct to ensure such issues don't happen again.




Venkat Mangira

Certified SAP Success factors Lead Global HRIT& Digital Automation/EC,Time off,People Analytics/Integration/ONB2.0/UKG/People&Culture Transformation(US B1/B2 Visa Holder)

8 个月

YES. Completely agreed and aligned.Business emablement teams will get an adverse impact for this outages. Rightly said CIOs must have plans to mitigate this issue at Data Center level.

要查看或添加评论,请登录

Umesh Pandey的更多文章

  • AWS Cloud Spend optmization Guide

    AWS Cloud Spend optmization Guide

    As Technology leaders are trying to assess their spending on cloud infrastructure, a simple yet effective way to…

    1 条评论

社区洞察

其他会员也浏览了