CrowdStrike Down
David Furman
CEO @IMS-Network | CISO @AiDock | Chairman @Israeli Gaming Association | ???????
CrowdStrike was once the world's largest and most respected IT security provider. However, a minor update turned them into the culprits behind the worst IT disaster in history.
A single software update caused the crash of 8.5 million computers, leading to the shutdown of planes, banks, hospitals, and governments and impacting half the world.
Timeline of the CrowdStrike IT Disaster - July 19, 2024 (IDT)
The Disaster Unfolds
The Aviation Crisis
CrowdStrike Responds
The Aftermath
Much of the world was back online within a day, but massive issues still lingered, given that each device had to be manually fixed. Less than 1% of all Windows machines were still affected, but these weren’t just any devices—they were at the heart of banks, airlines, hospitals, governments, and more. Even at only 1%, the impact was massive. It’s estimated that Fortune 500 companies experienced close to a $5.4 billion financial loss. This event was a disaster, all because of one new piece of information.
What Happened?
So, how did all of this happen? How did one update impact the IT world so significantly and so quickly?
It comes down to how CrowdStrike works. CrowdStrike’s Falcon Sensor is not just antivirus software—it’s the ultimate antivirus software. Think of the product as the immune system. It does a fantastic job of looking for threats and can automatically neutralize them. But when it malfunctions, it can cause terrible consequences. CrowdStrike’s Falcon Sensor is very similar. It operates at the lowest possible level of your computer—not just amongst user software, but in the “kernel,” the essential program that runs your operating system and talks to hardware.
This means that Falcon Sensor can see everything at the lowest level of monitoring, making it highly secure and the best line of defense. But that also comes with downsides and risks. There are fewer barriers between it and the hardware. If something goes wrong with user software, the program crashes. But the whole device crashes if something goes wrong in the kernel program. This is precisely what happened.
CrowdStrike didn’t release a major update—it was the smallest piece of new information to help identify new malicious software. However it was faulty information that created a logic error. This didn’t cause Windows to crash immediately, but it did cause stability issues. Windows immediately crashed the computer to prevent further damage when it noticed this. The blue screen of death was Windows protecting computers from the alternative.
The Cause (TL: DR)
CrowdStrike carries automated checks for these new updates, called a “Content Validator.” This can only go so far in identifying bugs, but given that CrowdStrike had rolled out thousands of such updates in the past, they didn’t feel that anything more was necessary. CrowdStrike says, "Due to baseline trust from the previous tests and successful deployments, no additional testing like dynamic checks was performed, so the bad update reached clients, causing the massive global IT outage.” They probably would have noticed the issue if they had tested this manually on any Windows PC.
领英推荐
The company dropped the ball and published the update without thorough testing. Given how much code companies like CrowdStrike deploy, some bad code will likely make it through, no matter how much testing they have. And that brings us to why cybersecurity experts are furious at CrowdStrike. They’re furious not because CrowdStrike developed some faulty code but because of how they deployed it.
Root Cause Analysis — Channel File 291
The CrowdStrike Falcon sensor leverages powerful on-sensor AI and machine learning models to protect customer systems by identifying and remediating the latest advanced threats. These models are continually updated and strengthened with insights from threat telemetry and intelligence gathered by CrowdStrike's security teams. The data begins as filtered and aggregated information on each sensor in a local graph store. The sensor correlates this context with live system activity to identify behaviors and indicators of attack (IOAs).
A vital part of this process is the Sensor Detection Engine, which combines built-in Sensor Content with Rapid Response Content delivered from the cloud. Rapid Response Content allows the sensor to gather telemetry, identify indicators of adversary behavior, and enhance detection capabilities without requiring code changes on the sensor.
Rapid Response Content is delivered through Channel Files and interpreted by the sensor’s Content Interpreter, which uses a regular-expression-based engine. Each Rapid Response Content channel file is associated with a specific Template Type built into the sensor. This Template Type provides the Content Interpreter with activity data and graph context to be matched against the Rapid Response Content.
With the release of sensor version 7.11 in February 2024, CrowdStrike introduced a new Template Type designed to detect novel attack techniques that abuse named pipes and other Windows interprocess communication (IPC) mechanisms. This new IPC Template Type was developed, tested, and integrated into the sensor following standard procedures. IPC Template Instances are delivered to sensors via a corresponding Channel File numbered 291.
However, the new IPC Template Type defined 21 input parameter fields, but the integration code invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values. This mismatch went unnoticed during multiple build validation and testing layers, including sensor release testing and stress testing of the Template Type with initial IPC Template Instances.
On July 19, 2024, two additional IPC Template Instances were deployed. One of these introduced a non-wildcard matching criterion for the 21st input parameter. This led to a new version of Channel File 291 requiring the sensor to inspect the 21st input parameter. The Content Validator evaluated the new Template Instances based on the expectation that 21 inputs would be provided.
As a result, sensors that received the new version of Channel File 291 were exposed to a latent out-of-bounds read issue in the Content Interpreter. When the new IPC Template Instances were evaluated against an IPC notification from the operating system, the Content Interpreter attempted to access the 21st value, which did not exist, leading to a system crash.
Technical Details
Here are the key technical components involved:
Crash Dump Analysis
The crash occurred due to an out-of-bounds memory read caused by the mismatch between the number of inputs provided and expected. Below is an excerpt from the crash dump analysis:
1: kd> !analyze -v
*******************************************************************************
* ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? *
* ? ? ? ? ? ? ? ? ? ? ? ?Bugcheck Analysis ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*
* ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? *
*******************************************************************************
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffffd6030000006a, memory referenced.
Arg2: 0000000000000000, X64: bit 0 set if the fault was due to a not-present PTE.
bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the processor decided the fault was due to a corrupted PTE.
bit 4 is set if the fault was due to attempted execute of a no-execute PTE.
- ARM64: bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the fault was due to attempted execute of a no-execute PTE.
Arg3: fffff8020ebc14ed, If non-zero, the instruction address which referenced the bad memory address.
Arg4: 0000000000000002, (reserved)
The out-of-bounds read occurred when the code attempted to access an invalid memory location due to the 21st input field mismatch. The faulty driver csagent.sys caused the crash as it tried to access a memory location beyond the allocated array.
This line shows the code instruction that caused the error:
csagent+0xe14ed:
fffff802`0ebc14ed 458b08 mov r9d,dword ptr [r8] ds:ffffd603`0000006a=????????
The invalid pointer in r8 led to an attempt to read beyond the allocated memory, resulting in a system crash.
Sources
Co-founder at Bubobot - next-gen monitoring for DevOps. AI-driven anomaly detection & prediction, with the fastest check intervals to safeguard your infrastructure.
7 个月That's a good article, David. Incidents can occur anywhere, at any time, and in a variety of forms.