??The CrowdStrike Windows Outage: A Detailed Case Study???
crowdstrike WordWide incident

??The CrowdStrike Windows Outage: A Detailed Case Study???


Recently, the CrowdStrike outage has been making headlines everywhere. If you're curious about what happened and why it matters, you're in the right place. This post breaks down the incident, its implications, and lessons we can all learn, even if you're not a tech expert. I'll preface with this best effort from my own experience and collecting information from the community to make the article more simple.. Let’s dive in! ??


??The Incident ??

On July 19, 2024 around 12 PM IST, many organisations encountered an unexpected issue: the Blue Screen of Death (BSOD), leading to widespread disruptions. An update from the cybersecurity company “crowdstrike” caused this mess, impacting critical systems in various sectors, including hospitals, airlines, and emergency services.When flights are delayed, hospital systems are down, and emergency services are unavailable, it’s serious business—it affects people's lives. ??


Who's CrowdStrike?

CrowdStrike is a cybersecurity company that protects machines from cyber threats with a lightweight agent installed on Windows systems. Protecting over 29,000 organizations, their main service is EDR (???????????????? ?????????????????? ?????? ????????????????), which catches malware and ransomware. Because CrowdStrike’s software integrates deeply into Windows, the fallout from this issue was massive, affecting the entire Windows system.

A Bug is Just a Bug, Until It's Not

???????'?? ?? ???????????????? ?????????????????? ???? ???????? ???????????????? ?????? ?????? ???? ?????? ????.

First, let's explain why you saw the infamous Blue Screen of Death (BSOD) on screens around the world today.

???????? - ???????? ???????????? ???? ??????????

A BSOD signifies a critical error, often involving kernel-level operations, which have privileged access to system resources. Windows operates in three modes:

  • ???????????? ????????????? - Where most programs run with limited access to system resources. The browser application you are reading this on is most likely running from the "user" space. In that context, the operating system can report the process crashed and as the user you can start it again. Hopefully it works and the issue doesn't continue, but at least you can use your computer still.
  • ?????????????????? ????????????? - Where critical software runs, with direct hardware access. This mode includes system drivers and EDR software. When you hear "system drivers", they likely are running here. Kernel Mode should be treated like Spiderman powers ("With great power, comes great responsibility.") Also, you saw how many major companies are impacted around the world which runs in Kernel mode.
  • ??? ????????????? - This is a Windows Exclusive and we don't really need to talk about it much in this context. But know that it exists and it's more locked down and for Microsoft Store Apps.

A kernel mode crash usually causes a BSOD. EDR software like CrowdStrike’s needs kernel access to monitor system events and provide the ability to stop actual malware from taking action on systems. When CrowdStrike’s kernel driver encountered an error, it led to a BSOD you saw in the news headlines because it’s a required boot-start driver.


Ok, but what happened?

CrowdStrike pushed what they called a "Channel File Update" to all customer systems. This file (C-00000291-00000000-00000032.sys) bypasses customer configured Sensor Update Policies and is a background update to the core components to all installed agents. Usually these are updated without issue and no action is ever needed by the user.

However, as you can see in the stack trace below, it introduced a Null Pointer error once the kernel driver (CSagent.sys) tried to load using this file. I won't go into what pointers are and C++ memory management but understand that when you write low-level languages you have the ability to do things that are unsafe and cause crashes if you don't write checks into your code. CrowdStrike is written in C++, failing to check for NULL pointers can cause crashes. This error led to a memory access violation, forcing Windows to crash the entire system.

This is a Stack Trace from a crash today. You can see the error "Access Violation" indicating that the error is a result of a problem accessing some memory. You can also see that the read address is 0x9c and there is a move (mov) operation which is Assembly for "copy data from here to there". Unfortunately, the memory location 0x9c is not accessible due to it being an invalid region of memory. Windows will always crash if something tries to access this address location.


Resolution Crowdstrike BSOD bug

To recover your machine:

  1. Access Windows Recovery Environment

  • Tap the F8 key repeatedly until you see the Recovery screen
  • Navigate : Troubleshoot -> Troubleshoot menu -> Advanced options - >In the Advanced options menu, click on Command Prompt

  • In the Command Prompt window, type the necessary commands to remove the faulty file, suggested in the below code ??

To delete:
del C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys

To disable:
@echo off
setlocal
REM Define the driver file pattern
set "driver_pattern=C-00000291*.sys"
REM Define the target directory
set "target_dir=C:\Windows\System32\drivers\CrowdStrike"
REM Change to the target directory
cd /d "%target_dir%" || (
 echo Failed to change directory to %target_dir%
 goto :error
)
REM Find the driver file
for %%f in (%driver_pattern%) do (
 set "driver_file=%%f"
 goto :found
)
echo No driver file matching %driver_pattern% found.
goto :error
:found
REM Extract the base name of the driver file (assuming the driver name without extension matches the service name)
set "driver_name=%driver_file:~0,-4%"
REM Disable the driver
sc config %driver_name% start= disabled || (
 echo Failed to disable the driver %driver_name%
 goto :error
)
echo Successfully disabled the driver %driver_name%
REM Reboot the system
shutdown /r /t 0
goto :eof
:error
echo An error occurred. Exiting without reboot.
endlocal
pause        

Other resolution steps include:

1. Uninstall or Disable CrowdStrike

  • Safe Mode Uninstall: In Safe Mode, navigate to Control Panel > Programs > Programs and Features, find CrowdStrike Falcon, and uninstall it.
  • Disable Sensor: If uninstalling is not feasible, disable the CrowdStrike Falcon sensor from the CrowdStrike management console.

2. Update Drivers and System Files

  • Driver Update: Ensure all system drivers are up-to-date. This can often resolve conflicts causing BSODs.
  • Windows Update: Run Windows Update to install the latest patches and updates from Microsoft.

3. Reinstall CrowdStrike

  • Updated Version: Download the latest version of CrowdStrike Falcon from the official website or management console.
  • Installation: Reinstall the updated version on the affected devices.

4. Test and Monitor

  • Initial Testing: After reinstallation, perform a series of tests to ensure the system operates without issues.
  • Continuous Monitoring: Use CrowdStrike’s monitoring tools to keep an eye on system performance and quickly identify any future issues.

Preventive Measures for the Future

1. Regular Backups : Data Backups: Ensure regular backups of critical data to mitigate data loss risks during system crashes.

2. Staged Rollouts : Implement phased rollouts for updates to critical software like CrowdStrike, allowing time to identify and address issues before widespread deployment.

3. System Compatibility Checks : Regularly perform compatibility tests for new software updates with existing system configurations to prevent conflicts.


Implications and Lessons

This incident underscores the critical nature of kernel-level software and the importance of thorough testing and robust error handling. It also highlights the need for effective crisis management strategies. Organizations must balance the need for quick updates with the importance of safety and reliability.


Moving Forward

The CrowdStrike outage serves as a wake-up call for the IT community to re-evaluate their internal processes. Ensuring rigorous testing, gradual rollout of updates, and robust error handling are crucial. Additionally, organizations should strengthen their resilience and adaptability to handle such crises effectively.


Final Thoughts

While CrowdStrike has been a trusted name in cybersecurity, this incident is a reminder of the complexities and risks involved in the field. It is an opportunity for the entire industry to learn, improve, and reinforce their commitment to making the digital world a safer place. This event calls for a collective effort to enhance technical practices, resilience, and a continuous learning mindset in the face of challenges.


??Thanks alot for reading this article and for your valuable time. I’d love to hear your thoughts! ??? If you liked this article, don’t forget to share it with your colleagues and friends. Let’s learn from each other and build a supportive community! ??

? Stay Curious, Stay Driven, and Forge Your Path to Success! ?

Follow me on Medium & LinkedIn


Dharmendra Singh

Computer Scientist 1 @ Adobe | Master of Technology, Computer Science

7 个月

Nice ??

要查看或添加评论,请登录

Pinky Polowalia的更多文章

社区洞察

其他会员也浏览了