THE DAY THE EARTH STOPPED: THE CROWDSTRIKE OUTAGE EXPLAINED

THE DAY THE EARTH STOPPED: THE CROWDSTRIKE OUTAGE EXPLAINED

On Friday, July 19, 2024, the world briefly stopped. Millions of computers around the world started failing and showing the infamous blue screen of death (BSOD) all due to a minor software update from a single cybersecurity called CrowdStrike. Systems running critical operations like transportation, healthcare and financial services got disrupted. Tesla CEO Elon Musk deemed it the “biggest IT fail ever”. Overall, in the United States, around four thousand flights got cancelled and more than nine thousand got delayed, in the United Kingdom scheduled surgeries got postponed while many other general practitioners and pharmacies were unable to work, and many banks around the world were unable to provide customer services. One significant economy that remained largely unaffected by the outage was China, as CrowdStrike is not widely used there. But how did CrowdStrike manage to paralyze a large chunk of the world?

Founded in 2011, CrowdStrike is an Austin, Texas-based cybersecurity company that has managed to become a key player in the industry in just a few years. With a valuation of $83 billion, the company has more than 29.000 customers of which more than 500 are on Fortune 1000. The company offers a wide range of cybersecurity products that blocks hackers and malwares from affecting computers. One of their products called Falcon, which helps protect systems from cyber-attacks, was the culprit for the global outages. Falcon is an EDR (Endpoint Detection and Response) system designed to identify, block and prevent hacking attempts. This system hooks itself to Microsoft’s kernel drivers, and kernel drivers operate at the lowest level of the computer. Meaning, Falcon has the highest level of permissions, giving it the ability to monitor operations in real time across the operating system.

On Friday morning, Falcon received a flawed update that crash any computer that have it installed, and pushed them in to a cycle of boot loops. Essentially, the computer would start loading Windows normally, detect a fatal error in the system, display the blue screen of death, and force the computer to start again. This would happen over and over again, rendering the affected computers useless. Two days later, in an official blog post, Microsoft estimated that around 8.5 million devices across numerous sectors were affected by the update.

About an hour later the update, CrowdStrike identified and deployed a fix for the issue. However, days later, the recovery process has proven itself to be far more complicated and time-consuming for businesses. The problem with the repair process is that the faulty update can not be resolved with a software patch from CrowdStrike, but require manual buggy code removal on-site. This means that IT workers had long shifts over the weekend fixing each machine individually. Although big firms are back online thanks to their resourceful IT staff and better resilience measures, small and medium-sized businesses are still struggling. In a Reddit thread, one IT worker mentioned having a 16-hour shift on Friday and fixing a total of 900 physical and virtual computers. And another one said that their company had about 170,000 crashed computers and about 300 people working to fixing them. Although CEO Kurtz reassured that the issue had been identified and the affected computers would be fixed swiftly, some IT workers predict that the process may take up weeks if not months.

Throughout the weekend, thousands of flights around the world were delayed or cancelled. Some airports had to manually check-in guests while in India, some passengers received hand-written boarding passes. Similarly, the healthcare sector in the U.S. faced disruptions in call centers and patient portals, while Mass General Brigham in Boston limited its services to urgent cases only. The UK saw offline booking systems affecting doctors and pharmacies, and Sky News, a major British broadcaster, was taken off the air. In Times Square, New York City, the famous digital screens went blue. The financial sector experienced significant issues, with banks like JPMorgan Chase, Bank of America, and TD Bank facing login and payment processing problems. Many online payment systems faced severe interruptions, preventing businesses from processing transactions smoothly. For instance, iPay88, a major payment gateway provider in Malaysia, experienced service disruptions, making it impossible for many e-commerce platforms to receive payments during the outage.

While the U.S. and Europe were rushing to fix the outages, Chinese companies were largely unaffected by them. The main reason for this is that almost all local firms use domestic cybersecurity products from companies like Tencent and 360. Similarly, government sectors and state-owned enterprises use China-made UOS (Unity Operating System). In recent years, we have witnessed significant technological advancements and increased reliance on digital solutions. This trend has only accelerated the need for robust and resilient IT infrastructure. However, despite all efforts, no system is immune to global IT outages. When such disruptions occur, the effects can be far-reaching, impacting multiple industries and geographies simultaneously. The recent global IT outage serves as a stark reminder of our collective vulnerability.

In recent years, it has become evident that China is distancing itself technologically from the rest of the world by developing its own solutions like WeChat, AliPay, Baidu, Douyin, and others. This trend indicates a growing separation between China’s tech landscape and that of the rest of the world. However, this separation also provides China with a distinct advantage in exceptional situations. By creating proprietary technologies and platforms, China has achieved a level of independence from global systems. This strategic decoupling ensures that during global IT outages, Chinese systems are often unaffected or can recover more swiftly. For example, while international platforms faced difficulties, Chinese users were able to continue using domestic services for their daily activities, highlighting the robustness and resilience of their self-sufficient tech ecosystem.

?This decoupling extends beyond just user-facing applications. China has developed its own cloud services, data centers, and cybersecurity measures that operate independently from Western technologies. This creates a dual-layer of protection: firstly, by reducing reliance on potentially vulnerable global networks, and secondly, by fostering innovation and robustness within its own tech landscape. In essence, while the rising tech walls may pose challenges for international integration, they also provide a safeguard against global disruptions, allowing China to leverage its unique position during times of crisis.

Although estimating the total cost of damages will be difficult, preliminary estimates suggest that the outage will cost the Fortune 500 companies alone more than $5.4 billion in direct losses. The healthcare and banking sectors were hit the hardest, with the estimated losses reaching $1.94 billion and $1.15 billion respectively. On a company level, Fortune 500 airlines such as American and United have a combined loss of $860 million. CrowdStrike itself has lost around 22% of its stock market value since the outage, wiping out $19.4 billion in market cap.

One question that comes to mind is, will CrowdStrike pay for at least some of the damages it has caused? As of now, there is no concrete information indicating that CrowdStrike will be legally required to pay damages for the July 19, 2024, outage. While CrowdStrike has acknowledged the issue and issued public apologies, the question of financial liability remains complex. Some businesses affected by the outage, especially those facing significant operational and financial losses, might pursue compensation through legal means. However, the specifics of such liabilities would depend on the contractual agreements between CrowdStrike and its clients, as well as any relevant insurance policies.

But was this outage preventable? Many experts believe it was. CrowdStrike could have followed the four levels of software testing: unit testing, integration testing, system testing, and acceptance testing. Unit testing ensures individual code components are bug-free, while integration testing checks that combined components work seamlessly together. System testing evaluates the entire integrated software for performance, security, and overall functionality. Finally, acceptance testing involves end users verifying the software meets business requirements and performs as expected in real-world scenarios. Thoroughly implementing these testing stages could have identified potential issues early, potentially preventing the outage. For example, CrowdStrike could have conducted an integration testing before releasing the Falcon update globally. They could have just simply tested the software update on several different computers with different versions of Microsoft OS installed. They could have then monitored the results and see how each version works after the update.

The CrowdStrike outage serves as a wake-up call for the entire cybersecurity industry, highlighting the critical need for resilience, redundancy, and collaborative defense mechanisms. Even the most robust systems require layers of redundancy and resilience to withstand unexpected failures. This global outage underscores the importance of proactive monitoring and rapid response capabilities in mitigating the impact of outages. As services return to normal, the focus should shift to preventing future incidents, with CrowdStrike pledging a thorough review of their systems and processes to enhance resilience. This event is also a stark reminder for businesses to have robust contingency plans and to diversify their cybersecurity strategies. While the disruption was significant, it offers an opportunity to learn and strengthen system defenses. As we navigate the aftermath of this outage, we must remain vigilant and committed to safeguarding our digital world, continually reflecting on lessons learned to better prepare for future incidents.

要查看或添加评论,请登录

ATP China的更多文章

社区洞察

其他会员也浏览了