??The CrowdStrike Windows Outage: A Detailed Case Study???
Pinky Polowalia
Technical Lead @TBO.COM | Gold Medalist | Formerly at Cvent | Transforming Ideas into Impactful Solutions | Problem Solver | Open Source Contributor | Building Scalable Systems
Recently, the CrowdStrike outage has been making headlines everywhere. If you're curious about what happened and why it matters, you're in the right place. This post breaks down the incident, its implications, and lessons we can all learn, even if you're not a tech expert. I'll preface with this best effort from my own experience and collecting information from the community to make the article more simple.. Let’s dive in! ??
??The Incident ??
On July 19, 2024 around 12 PM IST, many organisations encountered an unexpected issue: the Blue Screen of Death (BSOD), leading to widespread disruptions. An update from the cybersecurity company “crowdstrike” caused this mess, impacting critical systems in various sectors, including hospitals, airlines, and emergency services.When flights are delayed, hospital systems are down, and emergency services are unavailable, it’s serious business—it affects people's lives. ??
Who's CrowdStrike?
CrowdStrike is a cybersecurity company that protects machines from cyber threats with a lightweight agent installed on Windows systems. Protecting over 29,000 organizations, their main service is EDR (???????????????? ?????????????????? ?????? ????????????????), which catches malware and ransomware. Because CrowdStrike’s software integrates deeply into Windows, the fallout from this issue was massive, affecting the entire Windows system.
A Bug is Just a Bug, Until It's Not
???????'?? ?? ???????????????? ?????????????????? ???? ???????? ???????????????? ?????? ?????? ???? ?????? ????.
First, let's explain why you saw the infamous Blue Screen of Death (BSOD) on screens around the world today.
???????? - ???????? ???????????? ???? ??????????
A BSOD signifies a critical error, often involving kernel-level operations, which have privileged access to system resources. Windows operates in three modes:
A kernel mode crash usually causes a BSOD. EDR software like CrowdStrike’s needs kernel access to monitor system events and provide the ability to stop actual malware from taking action on systems. When CrowdStrike’s kernel driver encountered an error, it led to a BSOD you saw in the news headlines because it’s a required boot-start driver.
Ok, but what happened?
CrowdStrike pushed what they called a "Channel File Update" to all customer systems. This file (C-00000291-00000000-00000032.sys) bypasses customer configured Sensor Update Policies and is a background update to the core components to all installed agents. Usually these are updated without issue and no action is ever needed by the user.
However, as you can see in the stack trace below, it introduced a Null Pointer error once the kernel driver (CSagent.sys) tried to load using this file. I won't go into what pointers are and C++ memory management but understand that when you write low-level languages you have the ability to do things that are unsafe and cause crashes if you don't write checks into your code. CrowdStrike is written in C++, failing to check for NULL pointers can cause crashes. This error led to a memory access violation, forcing Windows to crash the entire system.
This is a Stack Trace from a crash today. You can see the error "Access Violation" indicating that the error is a result of a problem accessing some memory. You can also see that the read address is 0x9c and there is a move (mov) operation which is Assembly for "copy data from here to there". Unfortunately, the memory location 0x9c is not accessible due to it being an invalid region of memory. Windows will always crash if something tries to access this address location.
Resolution Crowdstrike BSOD bug
To recover your machine:
To delete:
del C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys
To disable:
@echo off
setlocal
REM Define the driver file pattern
set "driver_pattern=C-00000291*.sys"
REM Define the target directory
set "target_dir=C:\Windows\System32\drivers\CrowdStrike"
REM Change to the target directory
cd /d "%target_dir%" || (
echo Failed to change directory to %target_dir%
goto :error
)
REM Find the driver file
for %%f in (%driver_pattern%) do (
set "driver_file=%%f"
goto :found
)
echo No driver file matching %driver_pattern% found.
goto :error
:found
REM Extract the base name of the driver file (assuming the driver name without extension matches the service name)
set "driver_name=%driver_file:~0,-4%"
REM Disable the driver
sc config %driver_name% start= disabled || (
echo Failed to disable the driver %driver_name%
goto :error
)
echo Successfully disabled the driver %driver_name%
REM Reboot the system
shutdown /r /t 0
goto :eof
:error
echo An error occurred. Exiting without reboot.
endlocal
pause
Other resolution steps include:
1. Uninstall or Disable CrowdStrike
2. Update Drivers and System Files
3. Reinstall CrowdStrike
4. Test and Monitor
Preventive Measures for the Future
1. Regular Backups : Data Backups: Ensure regular backups of critical data to mitigate data loss risks during system crashes.
2. Staged Rollouts : Implement phased rollouts for updates to critical software like CrowdStrike, allowing time to identify and address issues before widespread deployment.
3. System Compatibility Checks : Regularly perform compatibility tests for new software updates with existing system configurations to prevent conflicts.
Implications and Lessons
This incident underscores the critical nature of kernel-level software and the importance of thorough testing and robust error handling. It also highlights the need for effective crisis management strategies. Organizations must balance the need for quick updates with the importance of safety and reliability.
Moving Forward
The CrowdStrike outage serves as a wake-up call for the IT community to re-evaluate their internal processes. Ensuring rigorous testing, gradual rollout of updates, and robust error handling are crucial. Additionally, organizations should strengthen their resilience and adaptability to handle such crises effectively.
Final Thoughts
While CrowdStrike has been a trusted name in cybersecurity, this incident is a reminder of the complexities and risks involved in the field. It is an opportunity for the entire industry to learn, improve, and reinforce their commitment to making the digital world a safer place. This event calls for a collective effort to enhance technical practices, resilience, and a continuous learning mindset in the face of challenges.
??Thanks alot for reading this article and for your valuable time. I’d love to hear your thoughts! ??? If you liked this article, don’t forget to share it with your colleagues and friends. Let’s learn from each other and build a supportive community! ??
? Stay Curious, Stay Driven, and Forge Your Path to Success! ?
Computer Scientist 1 @ Adobe | Master of Technology, Computer Science
7 个月Nice ??