Lessons to Take Away from the CrowdStrike Crisis
Thank you for clicking on our newsletter, Arbisoft Next. Before we dive into the topic, if you haven't already subscribed, please do so to stay updated on the latest tech and Arbisoft news.
If you're interested in partnering with us, contact us here. Our team of over 900 members across five global offices specializes in Artificial Intelligence, Traveltech, and Edtech. Our partner platforms serve millions of users daily.
We’re always excited to connect with people who are changing the world. Get in touch!
From handwritten plane boarding tickets to cash-only transactions at grocery stores - last week the world got a firsthand education of just how vulnerable our IT infrastructure is.?
The recent global IT outage caused by a corrupted software update by CrowdStrike serves as a grim reminder of the over-reliance on a single system and the importance of robust safeguards. A security update from CrowdStrike launched to its Falcon Sensor caused the 3rd party software to crash, and with it, almost 8.5 million Windows devices on the morning of July 19. Considering over half of Fortune 500 companies and U.S. government agencies use CrowdStrike as their primary cybersecurity provider, the resulting outage adversely impacted everything from banking and finance to travel, healthcare, and commerce worldwide.?
But what should we learn from this incident? Let’s take a look at 7 key takeaways.
1. Not All Clouds Are Created Equal
Cloud reliance is the new normal, but it's crucial to understand the risk it has! However, relying on a single cloud provider creates a monoculture. Imagine putting all your eggs in one basket. That's kind of what's happening with the cloud these days. If one of them has a hiccup, like what happened recently, it can cripple the entire network! Businesses might want to think about considering a multi-cloud strategy. That way, if one cloud faces any technical malfunction, the rest can still keep things running smoothly - especially when the influence of their services is gigantic!?
Mark Boost, CEO of Civo, emphasizes the dangers of monoculture:?
"The outage highlights the over-reliance risk on a single system or provider. Even established giants aren't invincible."
2. The Code Flaws
The exact cause of the problematic CrowdStrike update remains under debate (kind of!). One theory points to a null pointer error, a common C++ coding bug where a variable is used before being assigned a valid memory location. While CrowdStrike denies this, security researchers like Tavis Ormandy (Google) and Patrick Wardle (Objective-See) suspect a logic error. Regardless of the specifics, the faulty code should never have reached production.
领英推荐
3. QA is Necessary!?
CrowdStrike's quality assurance (QA) team is under scrutiny for letting this update slip through the cracks. This raises the question of how such a critical security patch bypassed client-side controls and rolled out to everyone. Konstantin Klyagin of Redwerk and QAwerk highlights the importance of automated testing, especially for large-scale updates, where manual testing might miss crucial issues.
4. Communication is Key in Chaos
The outage showcased the importance of clear and timely communication during a crisis. Businesses need to be prepared to inform stakeholders — employees, customers, and partners — about the situation, what's being done to fix it, and when normal operations are expected to resume. Regular updates, even if just to acknowledge ongoing investigations, help maintain trust and prevent confusion.
Both CrowdStrike and Microsoft demonstrated the importance of swift action in mitigating such crises. Their collaborative efforts to provide manual solutions and reroute traffic highlight the need for robust incident response plans.
5. Phased Rollouts Prevent Crisis
The simultaneous rollout of the update across all systems by many organizations is another critical lesson. While staged rollouts might seem time-consuming, they are essential for mission-critical systems. Techniques like blue/green deployments, canary deployments, and A/B testing allow for controlled rollouts, minimizing risk. Additionally, robust rollback procedures are crucial for reverting to a stable version if problems arise.
6. Test, Refine, Repeat
The importance of disaster recovery plans and reliable backups cannot be overstated. As cyber threats and technological complexities evolve, disaster recovery plans need to adapt. Turns out, businesses need ‘fire drills’ too, but for tech emergencies! Regularly or periodic practicing these "dry runs" by testing backups and recovery procedures ensures everything works smoothly when things go south. After all, a tech meltdown shouldn't turn into a full-blown crisis!
There are many instances where organizations faced many challenges lacking rapid backup solutions - like Hollywood Presbyterian Medical Center Ransomware in 2021 and Riviera Beach city government Florida cyberattack. Cloud backups, while convenient, introduce complexity. Traditional disaster recovery methods and backups would have proven invaluable in this situation.
7. Monitoring and Response
The global reach of the outage emphasizes the need for advanced monitoring tools and well-defined incident response plans. Real-time monitoring can detect issues early on, while proper incident response plans ensure fast identification, isolation, and resolution. Continuous monitoring, root-cause analysis, and post-incident reviews are all crucial for building resilience.
The CrowdStrike incident serves as a stark reminder and a wake-up call that even routine maintenance can be disruptive if not managed and assessed properly. It highlights the interconnectedness of modern IT systems and the cascading effects of failures in widely used software. By implementing robust risk management strategies and learning from past events, IT teams can be better prepared to weather the next storm.
Machine learning engineer || Big Data || Python || AWS || LLM || XAI || GenAi
7 个月Informative !
Software Engineer | Skilled in Machine Learning & Deep Learning | Passionate About Emerging Technologies | Future Data Scientist on a Mission to Transform Industries with AI
7 个月Insightful!
Cyber Security Analyst | (ISC)2 Certified in Cybersecurity | Wazuh | SIEM | Artificial Intelligence for Cybersecurity | VAPT | Threat Intelligence
7 个月Informative
SWE @Enigmatix | Bridging Technology & Business Innovation | Passionate About Startup Innovation & Technology Trends | Python | Django | DRF | FastAPI
7 个月Well written.