Crowdstrike Outage: A Lesson in Security Patching Gone Wrong
Theresa McFarlane
Project Management, Agile, Delivery Management Professional | Agile Coach | Sr. Scrum Master | Product Owner | Atlassian Admin SME, Community Leader and Creator | A.I., Data. and Cyber Security Enthusiast
The recent widespread IT outage caused by a faulty Crowdstrike update serves as a powerful reminder of the importance of well-defined security patching processes. While keeping systems updated is essential for cybersecurity, this incident underscores the critical need for robust testing, clear communication protocols, and best practices drawn from project management and agile methodologies. Let's delve deeper into the key takeaways and how they translate into more secure and efficient patching strategies.
This event also sparked a teachable moment, as many parents can attest. When children encounter complex events in the news, curiosity often follows. The following story, inspired by conversations with Rosie and Vicky, aims to explain the situation in a way that resonates with a younger audience as well as those non-technical folks. Without further hesitation, The Very Agile PM Presents....
Now back to our regularly scheduled programming...What happened last week???!!!
The Culprit: A Faulty Falcon Update
The outage stemmed from a routine update for Crowdstrikes Falcon endpoint detection and response (EDR) software. Unfortunately, the update contained a bug that triggered a blue screen of death (BSOD) on affected Microsoft Windows devices. This critical error rendered countless machines inoperable, causing significant disruptions across various industries.
Impact Beyond Security
The fallout from the outage extended far beyond cybersecurity concerns. Airports around the globe grounded flights, financial transactions stalled, and even emergency services were hampered. This cascading effect underscores the critical role cybersecurity plays in modern infrastructure.
Lessons Learned
Several key takeaways can be gleaned from this incident:
Importance of Thorough Testing: Rigorous testing procedures are essential before deploying any security updates, especially those with system-wide implications.
Never Test in Production and Label Your Environments Clearly
Production environments, often denoted as prod or production in system labels, should be reserved for stable, reliable, and mission-critical systems. Testing updates in separate, controlled environments allows for the identification and resolution of issues before they impact critical systems.
These separate environments should use clear labels for easy identification, such as: Development (dev): This environment is used for initial development and coding of the update.
Testing (test): This environment replicates a production-like setting for rigorous testing of the update before deployment.
Staging (stage): This environment serves as a final testing ground before pushing the update to production. It closely mirrors the production environment, allowing for a final check for compatibility and functionality.
Deploying an update directly to a production environment for testing significantly increases the risk of disrupting core functionalities and causing widespread outages.
Strategic Release Scheduling: Scheduling releases for times with lower operational impact is crucial. Fridays are often avoided due to several reasons:
Reduced Staffing: Many IT teams have smaller support staff on weekends. If an issue arises with a Friday release, it can lead to delayed resolutions due to limited personnel available to fix the problem.
Limited Development Resources: Developers may not be readily available to address critical bugs or unexpected issues that emerge after a Friday deployment. This can leave organizations scrambling to find solutions over the weekend.
Change Aversion on Fridays: People are generally less receptive to major changes on the cusp of the weekend. A Friday release might disrupt workflows or confuse folks as employees are mentally preparing for downtime and might be checking out early.
Testing Window Concerns: Ideally, new releases undergo thorough testing before deployment. A Friday release limits the amount of time for this testing to occur during the workweek, potentially increasing the risk of bugs slipping through the cracks.
Clear Testing Plans and Communication with Confluence and Jira:
Well-defined testing plans documented in tools like Jira/Confluence are essential. However, to enhance transparency and collaboration, consider this approach:
Confluence for Detailed Plans and Collaboration: Utilize Confluence to document well-defined testing plans. These plans should detail the scope of testing, including specific functionalities and potential risk areas. Additionally, Confluence allows for creating clear test cases outlining the steps to be followed and the expected outcomes. This fosters transparency and collaboration among development, testing, and security teams. Team members can easily access the plans, add comments, and discuss potential issues before deployment.
Jira for Tracking and Execution: Jira is a powerful tool for managing the testing process itself. Test cases created in Confluence can be linked to Jira issues, allowing for efficient tracking and execution. Testers can use Jira to record their progress, log any bugs encountered, and track the overall status of the testing cycle.
This integration provides a clear view of the testing progress and facilitates communication between teams. By combining the detailed planning capabilities of Confluence with the tracking and execution strengths of Jira, organizations can achieve clear communication and ensure a smooth and efficient testing process. This approach minimizes the risk of unforeseen issues like the one experienced with the Crowdstrike update.
Rollback Plan: A well-defined rollback plan ensures a swift recovery in case of unforeseen issues. This plan should outline the steps to revert to the previous version of the software and minimize downtime.
Communication is Key: Clear and timely communication during outages minimizes confusion and allows organizations to take necessary mitigation steps. As a PM or Agile Professional, when I know my team is going to be doing anything that will have an impact, I let those being impacted know early on. Even letting your customer support teams know in advance via email or other forms of communication that you will be doing a release on xx/xx/xxxx at 00:00 EST and it will impact x, y, and z. What is the release, and why are we doing this, if you see anything, report it to XYZ. The same goes for any scheduled outages or slowdowns. I can't tell you how many times I have seen money wasted by chasing ghosts due to a lack of communication during testing/release.
The Crowdstrike outage exposed an unexpected wrinkle: certain companies, like 美国联合包裹服务 , 联邦快递 , and Southwest Airlines , reportedly remained unaffected due to their reliance on a much older operating system, Windows 3.1 (released in 1992). While this news is initially concerning from a cybersecurity standpoint – a 32-year-old OS lacks the support and security patches of modern systems – it also sparks a fascinating question. Could the Y2K scare, which prompted a global focus on software compatibility and updates, have inadvertently had a silver lining? Perhaps the rigorous testing conducted back then unexpectedly protected these legacy systems from a vulnerability in the Crowdstrike update. This incident highlights the complex interplay between security updates, outdated systems, and the potential for unforeseen consequences.
One thing is clear as I have always said when I was responsible for the Product Security Champion program:
Security should never be an afterthought. IT/Security should be well-funded in organizations as it will cost companies 10x more in brand reputation, loss of sales, and remediating the situation.
Why the Crowdstrike Outage Matters to Project Managers and Agile Teams
The recent Crowdstrike outage serves as a cautionary tale for project managers and agile professionals, highlighting the critical role they play in ensuring secure and smooth deployments. Here's why understanding this incident is crucial:
By understanding these lessons from the CrowdStrike outage, project managers and agile professionals can become valuable assets in their organizations. They can champion secure practices, facilitate clear communication, and ensure a smooth and efficient patching process, minimizing the risk of outages and fostering trust with stakeholders.
Remember, by being proactive and organized, with checklists and clear communication channels, you can become a superstar for your team, preventing issues and ensuring successful deployments. Again, why Atlassian products are always my partner in crime with documentation. I can't tell you how many times using Jira/Confluence prevented mishaps.
#atlassian #atlassiancreator #cloudstrike #projectmanagement #testing #QAT #releasemanagement #agile #scrum #jira #confluence
Learning Scientist | Future-proofing orgs with IT process innovation
2 个月Excellent points! Emphasizing separate environment testing and clear communication can prevent such incidents in the future.