Crowdstrike Outage: A Lesson in Security Patching Gone Wrong

Theresa McFarlane

Project Management, Agile, Delivery Management Professional | Agile Coach | Sr. Scrum Master | Product Owner | Atlassian Admin SME, Community Leader and Creator | A.I., Data. and Cyber Security Enthusiast

发布日期: 2024年7月23日

The recent widespread IT outage caused by a faulty Crowdstrike update serves as a powerful reminder of the importance of well-defined security patching processes. While keeping systems updated is essential for cybersecurity, this incident underscores the critical need for robust testing, clear communication protocols, and best practices drawn from project management and agile methodologies. Let's delve deeper into the key takeaways and how they translate into more secure and efficient patching strategies.

This event also sparked a teachable moment, as many parents can attest. When children encounter complex events in the news, curiosity often follows. The following story, inspired by conversations with Rosie and Vicky, aims to explain the situation in a way that resonates with a younger audience as well as those non-technical folks. Without further hesitation, The Very Agile PM Presents....

The Adventures of Falcon and Billy - Page 2

Now back to our regularly scheduled programming...What happened last week???!!!

The Culprit: A Faulty Falcon Update

The outage stemmed from a routine update for Crowdstrikes Falcon endpoint detection and response (EDR) software. Unfortunately, the update contained a bug that triggered a blue screen of death (BSOD) on affected Microsoft Windows devices. This critical error rendered countless machines inoperable, causing significant disruptions across various industries.

Impact Beyond Security

The fallout from the outage extended far beyond cybersecurity concerns. Airports around the globe grounded flights, financial transactions stalled, and even emergency services were hampered. This cascading effect underscores the critical role cybersecurity plays in modern infrastructure.

Lessons Learned

Several key takeaways can be gleaned from this incident:

Importance of Thorough Testing: Rigorous testing procedures are essential before deploying any security updates, especially those with system-wide implications.

Never Test in Production and Label Your Environments Clearly

Production environments, often denoted as prod or production in system labels, should be reserved for stable, reliable, and mission-critical systems. Testing updates in separate, controlled environments allows for the identification and resolution of issues before they impact critical systems.

These separate environments should use clear labels for easy identification, such as: Development (dev): This environment is used for initial development and coding of the update.

Testing (test): This environment replicates a production-like setting for rigorous testing of the update before deployment.

Staging (stage): This environment serves as a final testing ground before pushing the update to production. It closely mirrors the production environment, allowing for a final check for compatibility and functionality.

Deploying an update directly to a production environment for testing significantly increases the risk of disrupting core functionalities and causing widespread outages.

Strategic Release Scheduling: Scheduling releases for times with lower operational impact is crucial. Fridays are often avoided due to several reasons:

Reduced Staffing: Many IT teams have smaller support staff on weekends. If an issue arises with a Friday release, it can lead to delayed resolutions due to limited personnel available to fix the problem.

Limited Development Resources: Developers may not be readily available to address critical bugs or unexpected issues that emerge after a Friday deployment. This can leave organizations scrambling to find solutions over the weekend.

Change Aversion on Fridays: People are generally less receptive to major changes on the cusp of the weekend. A Friday release might disrupt workflows or confuse folks as employees are mentally preparing for downtime and might be checking out early.

Testing Window Concerns: Ideally, new releases undergo thorough testing before deployment. A Friday release limits the amount of time for this testing to occur during the workweek, potentially increasing the risk of bugs slipping through the cracks.

Clear Testing Plans and Communication with Confluence and Jira:

Well-defined testing plans documented in tools like Jira/Confluence are essential. However, to enhance transparency and collaboration, consider this approach:

Confluence for Detailed Plans and Collaboration: Utilize Confluence to document well-defined testing plans. These plans should detail the scope of testing, including specific functionalities and potential risk areas. Additionally, Confluence allows for creating clear test cases outlining the steps to be followed and the expected outcomes. This fosters transparency and collaboration among development, testing, and security teams. Team members can easily access the plans, add comments, and discuss potential issues before deployment.

Jira for Tracking and Execution: Jira is a powerful tool for managing the testing process itself. Test cases created in Confluence can be linked to Jira issues, allowing for efficient tracking and execution. Testers can use Jira to record their progress, log any bugs encountered, and track the overall status of the testing cycle.

This integration provides a clear view of the testing progress and facilitates communication between teams. By combining the detailed planning capabilities of Confluence with the tracking and execution strengths of Jira, organizations can achieve clear communication and ensure a smooth and efficient testing process. This approach minimizes the risk of unforeseen issues like the one experienced with the Crowdstrike update.

Rollback Plan: A well-defined rollback plan ensures a swift recovery in case of unforeseen issues. This plan should outline the steps to revert to the previous version of the software and minimize downtime.

Communication is Key: Clear and timely communication during outages minimizes confusion and allows organizations to take necessary mitigation steps. As a PM or Agile Professional, when I know my team is going to be doing anything that will have an impact, I let those being impacted know early on. Even letting your customer support teams know in advance via email or other forms of communication that you will be doing a release on xx/xx/xxxx at 00:00 EST and it will impact x, y, and z. What is the release, and why are we doing this, if you see anything, report it to XYZ. The same goes for any scheduled outages or slowdowns. I can't tell you how many times I have seen money wasted by chasing ghosts due to a lack of communication during testing/release.

The Crowdstrike outage exposed an unexpected wrinkle: certain companies, like 美国联合包裹服务 , 联邦快递 , and Southwest Airlines , reportedly remained unaffected due to their reliance on a much older operating system, Windows 3.1 (released in 1992). While this news is initially concerning from a cybersecurity standpoint – a 32-year-old OS lacks the support and security patches of modern systems – it also sparks a fascinating question. Could the Y2K scare, which prompted a global focus on software compatibility and updates, have inadvertently had a silver lining? Perhaps the rigorous testing conducted back then unexpectedly protected these legacy systems from a vulnerability in the Crowdstrike update. This incident highlights the complex interplay between security updates, outdated systems, and the potential for unforeseen consequences.

When I thought about it, I envisioned and created this meme

One thing is clear as I have always said when I was responsible for the Product Security Champion program:

Security should never be an afterthought. IT/Security should be well-funded in organizations as it will cost companies 10x more in brand reputation, loss of sales, and remediating the situation.

Why the Crowdstrike Outage Matters to Project Managers and Agile Teams

The recent Crowdstrike outage serves as a cautionary tale for project managers and agile professionals, highlighting the critical role they play in ensuring secure and smooth deployments. Here's why understanding this incident is crucial:

Identifying Risks Early: Project managers and agile teams are responsible for risk assessment throughout the software development lifecycle. The Crowdstrike outage emphasizes the importance of factoring in potential security patch vulnerabilities into the risk matrix. By proactively identifying these risks, teams can implement mitigation strategies like rigorous testing in non-production environments.
Championing Robust Testing: Agile methodologies promote iterative development with frequent testing cycles. This incident reinforces the need for thorough testing, especially for security patches, before pushing them to production. Project managers can advocate for sufficient time and resources to be allocated for testing, ensuring quality and minimizing the risk of widespread outages.
Effective Communication: Clear and transparent communication is a hallmark of successful agile projects. The Crowdstrike outage underscores the importance of seamless communication between development, testing, and security teams. Project managers can facilitate open communication channels, fostering collaboration and ensuring everyone is aware of potential issues before deployment.
Importance of Non-Production Environments: The incident highlights the dangers of deploying updates directly to production environments. Project managers can champion the use of separate, controlled environments (dev, test, stage) for rigorous testing before pushing updates live. This allows for the identification and resolution of bugs before they impact critical systems.
Embracing Post-Launch Testing: Testing shouldn't end once a release goes live. Project managers can work with agile teams to establish processes for post-launch monitoring and testing. This can help identify any unforeseen issues and enable rapid response if necessary.

By understanding these lessons from the CrowdStrike outage, project managers and agile professionals can become valuable assets in their organizations. They can champion secure practices, facilitate clear communication, and ensure a smooth and efficient patching process, minimizing the risk of outages and fostering trust with stakeholders.

Remember, by being proactive and organized, with checklists and clear communication channels, you can become a superstar for your team, preventing issues and ensuring successful deployments. Again, why Atlassian products are always my partner in crime with documentation. I can't tell you how many times using Jira/Confluence prevented mishaps.

#atlassian #atlassiancreator #cloudstrike #projectmanagement #testing #QAT #releasemanagement #agile #scrum #jira #confluence

Crowdstrike Outage: A Lesson in Security Patching Gone Wrong

Theresa McFarlane

Project Management, Agile, Delivery Management Professional | Agile Coach | Sr. Scrum Master | Product Owner | Atlassian Admin SME, Community Leader and Creator | A.I., Data. and Cyber Security Enthusiast

The Culprit: A Faulty Falcon Update

Impact Beyond Security

Lessons Learned

Why the Crowdstrike Outage Matters to Project Managers and Agile Teams

The Very Agile PM

4,572 位关注者

更多精彩文章

社区洞察

The Culprit: A Faulty Falcon Update

Impact Beyond Security

Lessons Learned

Why the Crowdstrike Outage Matters to Project Managers and Agile Teams

The Very Agile PM

4,572 位关注者

Navigating PI Planning When You're Lost

2024年10月3日

Navigating the Product Manager vs. the Scrum Product Owner Shifting Landscape

2024年9月25日

The Perils of Individual Velocity Tracking

2024年9月2日

Holiday Readiness Strategies for Peak Performance

2024年8月30日

Embracing Authenticity over Toxic Positivity at Work and the Job Search

2024年8月16日

Agile Experts: Navigating the Shifting Market

2024年8月14日

Building for the Future: The Critical Roles of PMO, Agile PMO, and VMO

2024年8月8日

Top 5 Tips: Fostering Relentless Improvement with Agile Teams

2024年7月24日

SAFE Job Searching for Women in Tech

2024年7月11日

Rising Unemployment in June: A Double-Edged Sword for Agile and Project Management Professionals

2024年7月5日

社区洞察