Why Did The Crowdstrike Incident Happen and Where Do We Go From Here

On the morning of Friday, July 19th, 2024, CrowdStrike updated their software, causing 8.5 million Microsoft Windows machines to crash. This software is essentially antivirus, intrusion detection, anti-malware, and a key kernel component.

I wanted to investigate why this happened and how it affected airlines, hospitals, hotels, TV stations, and banks. People started seeing the Blue Screen of Death (BSOD) early that morning, with hundreds of pictures from airports and other places showing this error on screens worldwide. Five days later, some places are still struggling to recover. The only way to fix it is to physically access the machine, boot into safe mode, or use a USB key from Microsoft to remove the faulty code. Even with virtual machines (VMs), you need to mount the USB key on each machine one by one, which is easier than locating physical machines, which could be thousands of miles away.

One of the first issues is that CrowdStrike didn't seem to test this update thoroughly or at all. It affected their entire Windows customer base. If they had tested it, they would have found that this update bricks Windows 10 machines and causes the BSOD. It remains a mystery how this happened, though we might get some partial answers. Most software companies test their updates before releasing them to customers to avoid complaints, but CrowdStrike did not.

Another major issue is the lack of a canary install. Instead of pushing the update to a few customers first, waiting for success, and then proceeding to the next batch, CrowdStrike did a global push. Every customer with their Falcon Sensor had their machines freeze at the BSOD. Why would they do this? These two issues alone seem like common sense. We might discover the reasons, likely related to corporate culture and the "You Only Live Once" (YOLO) mentality.

There are a few more issues to address before we move on to the causes and customer impacts. CrowdStrike marked their driver as a bootstart driver, meaning if it's missing, the machine wouldn't boot. This seems like corporate malfeasance, almost like a virus or malware itself.

RedHat, the top enterprise Linux company, warned about this in June: "Disabling the CrowdStrike Falcon Sensor/Agent software suite will mitigate the crashes and provide temporary stability to the system in question while the issue is investigated." The issue was observed in releases 6 and 7. So, someone at CrowdStrike knew there were issues with their scanner causing kernel panics in Linux, yet they proceeded with the update for their larger Windows market. Again, deployment by YOLO. Also, deploying on a Friday is a known risk among seasoned DevOps/System Administrators, as it can ruin your weekend.

Additionally, we haven't discussed why Microsoft allows these scanners to run in ring 0 (kernel mode), the most trusted part of an operating system. This access is partly due to legal issues, as regulators in the EU have forced Microsoft to open its kernel to these vendors. Microsoft owns 40% of the endpoint protection market, with CrowdStrike second at 14%. Technically, CrowdStrike's Falcon endpoint protection, anti-malware, antivirus, and intrusion detection scanner need to operate in ring 0.

Elon Musk recently tweeted that "Boeing has too many nontechnical managers." CrowdStrike's CEO, George Kurtz, was once the CTO of McAfee and in 2010 pushed out an update that deleted a key Windows file, causing millions of machines to crash. Sound familiar? After that debacle, he founded CrowdStrike a year later. He's an accountant by training, worked for PriceWaterHouseCoopers, and wrote a notoriously bad book called Hacking Exposed. But does he really understand ring 0, cybersecurity, or the difference between a bit or a byte? He's been a manager for 25+ years and knows how to raise money and make a lot of money! The Chief Security Officer for CrowdStrike was in the FBI for 25 years and then worked at CrowdStrike for the last 12 years as the chief security officer. Do you think he understands ring 0 or operating systems? The problem is that fewer people at the management level understand how these systems work. Most of what they peddle is marketing speak without real technical definitions behind it. Carl Sagan once said, "We've arranged a global civilization in which most crucial elements...profoundly depend on science and technology. We have also arranged things so that almost no one understands science and technology. This is a prescription for disaster."

Now onto the customers. They were hit by something CrowdStrike never told them about. They can't automatically turn off these "channel files"—virus signature files. CrowdStrike and the entire cybersecurity industry have used fear-mongering, insisting on these updates because of "zero-day exploits," but the medicine may be worse than the cure. CrowdStrike is indicative of the IT world in general. Most CTOs are managers and C-suite executives who aren't technical. They go golfing and get corporate seats from these vendors, thinking they are protected. Another issue is that many customers, like airlines, have offshored most of their IT work. So, when physical presence at each keyboard is needed, there weren't enough hands to help fix the problem.

As an industry, we need to better police ourselves and not let cybersecurity firms run the show. CrowdStrike isn't necessary if individual PCs had their Windows firewall turned on, Windows Defender running, and corporate firewalls detecting and denying bad actors. Modern firewalls do a great job. We also need more people entering our industry. Where are all the young kids in IT anymore? We need to offer training and incentives for kids to become the person in front of the keyboard and not offshore all our work.





Thanks for sharing! So how many man hours would it take to restore all of #Deltas computers & routers? All airports? All 911 emergency services in all 50 states?

回复
Junaid Abro

Web Designer & Content Writer | WordPress Developer | SEO Specialist | Exploring Back-End Technologies

7 个月
Moh Salah

Drupal PHP Developer

7 个月

Mark Scheck great investigation and insight. I can’t image such a big deployment rollout not catching this bug. I wrote earlier about how this could be avoided by using Google Cloud’s revision management, gradual rollouts and traffic splitting features. ?Devops/SRE engineers could have pointed fraction of the production traffic to the updated nodes and upon seeing failures could have backout of it with single click (revert to last working version) etc. #googlecloud https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration https://www.dhirubhai.net/posts/activity-7222281517241122819-N6ff?utm_source=share&utm_medium=member_desktop

The #Crowdstrike #glitch was caused by a single line of code which has a #memory #overflow #bug ( out of a package which has millions of lines of code!) according to US sources! #Microsoft has issued a #patch to fix it, but it requires an IT professional to use it, so smaller companies without an IT dept. are hard hit, and #Delta #airlines has so many terminals that they are lagging behind in rolling out the update!

Jeremy Page

Digital Forensics/Incident Response | DFIR | Digital Investigations | Mobile Forensics

7 个月

Read-Only Fridays should become a standard for all IT departments.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了