Crowdstrike Outage: A Lesson in Security Patching Gone Wrong

Crowdstrike Outage: A Lesson in Security Patching Gone Wrong

The recent widespread IT outage caused by a faulty Crowdstrike update serves as a powerful reminder of the importance of well-defined security patching processes. While keeping systems updated is essential for cybersecurity, this incident underscores the critical need for robust testing, clear communication protocols, and best practices drawn from project management and agile methodologies. Let's delve deeper into the key takeaways and how they translate into more secure and efficient patching strategies.

This event also sparked a teachable moment, as many parents can attest. When children encounter complex events in the news, curiosity often follows. The following story, inspired by conversations with Rosie and Vicky, aims to explain the situation in a way that resonates with a younger audience as well as those non-technical folks. Without further hesitation, The Very Agile PM Presents....

The Adventures of Falcon & Billy
The Adventures of Falcon and Billy - Page 2

Now back to our regularly scheduled programming...What happened last week???!!!

The Culprit: A Faulty Falcon Update

The outage stemmed from a routine update for Crowdstrikes Falcon endpoint detection and response (EDR) software. Unfortunately, the update contained a bug that triggered a blue screen of death (BSOD) on affected Microsoft Windows devices. This critical error rendered countless machines inoperable, causing significant disruptions across various industries.

Impact Beyond Security

The fallout from the outage extended far beyond cybersecurity concerns. Airports around the globe grounded flights, financial transactions stalled, and even emergency services were hampered. This cascading effect underscores the critical role cybersecurity plays in modern infrastructure.

Lessons Learned

Several key takeaways can be gleaned from this incident:

Importance of Thorough Testing: Rigorous testing procedures are essential before deploying any security updates, especially those with system-wide implications.

Never Test in Production and Label Your Environments Clearly

Production environments, often denoted as prod or production in system labels, should be reserved for stable, reliable, and mission-critical systems. Testing updates in separate, controlled environments allows for the identification and resolution of issues before they impact critical systems.

These separate environments should use clear labels for easy identification, such as: Development (dev): This environment is used for initial development and coding of the update.

Testing (test): This environment replicates a production-like setting for rigorous testing of the update before deployment.

Staging (stage): This environment serves as a final testing ground before pushing the update to production. It closely mirrors the production environment, allowing for a final check for compatibility and functionality.

Deploying an update directly to a production environment for testing significantly increases the risk of disrupting core functionalities and causing widespread outages.

Strategic Release Scheduling: Scheduling releases for times with lower operational impact is crucial. Fridays are often avoided due to several reasons:

Reduced Staffing: Many IT teams have smaller support staff on weekends. If an issue arises with a Friday release, it can lead to delayed resolutions due to limited personnel available to fix the problem.

Limited Development Resources: Developers may not be readily available to address critical bugs or unexpected issues that emerge after a Friday deployment. This can leave organizations scrambling to find solutions over the weekend.

Change Aversion on Fridays: People are generally less receptive to major changes on the cusp of the weekend. A Friday release might disrupt workflows or confuse folks as employees are mentally preparing for downtime and might be checking out early.

Testing Window Concerns: Ideally, new releases undergo thorough testing before deployment. A Friday release limits the amount of time for this testing to occur during the workweek, potentially increasing the risk of bugs slipping through the cracks.

Clear Testing Plans and Communication with Confluence and Jira:

Well-defined testing plans documented in tools like Jira/Confluence are essential. However, to enhance transparency and collaboration, consider this approach:

Confluence for Detailed Plans and Collaboration: Utilize Confluence to document well-defined testing plans. These plans should detail the scope of testing, including specific functionalities and potential risk areas. Additionally, Confluence allows for creating clear test cases outlining the steps to be followed and the expected outcomes. This fosters transparency and collaboration among development, testing, and security teams. Team members can easily access the plans, add comments, and discuss potential issues before deployment.

Jira for Tracking and Execution: Jira is a powerful tool for managing the testing process itself. Test cases created in Confluence can be linked to Jira issues, allowing for efficient tracking and execution. Testers can use Jira to record their progress, log any bugs encountered, and track the overall status of the testing cycle.

This integration provides a clear view of the testing progress and facilitates communication between teams. By combining the detailed planning capabilities of Confluence with the tracking and execution strengths of Jira, organizations can achieve clear communication and ensure a smooth and efficient testing process. This approach minimizes the risk of unforeseen issues like the one experienced with the Crowdstrike update.

Rollback Plan: A well-defined rollback plan ensures a swift recovery in case of unforeseen issues. This plan should outline the steps to revert to the previous version of the software and minimize downtime.

Communication is Key: Clear and timely communication during outages minimizes confusion and allows organizations to take necessary mitigation steps. As a PM or Agile Professional, when I know my team is going to be doing anything that will have an impact, I let those being impacted know early on. Even letting your customer support teams know in advance via email or other forms of communication that you will be doing a release on xx/xx/xxxx at 00:00 EST and it will impact x, y, and z. What is the release, and why are we doing this, if you see anything, report it to XYZ. The same goes for any scheduled outages or slowdowns. I can't tell you how many times I have seen money wasted by chasing ghosts due to a lack of communication during testing/release.

The Crowdstrike outage exposed an unexpected wrinkle: certain companies, like 美国联合包裹服务 , 联邦快递 , and Southwest Airlines , reportedly remained unaffected due to their reliance on a much older operating system, Windows 3.1 (released in 1992). While this news is initially concerning from a cybersecurity standpoint – a 32-year-old OS lacks the support and security patches of modern systems – it also sparks a fascinating question. Could the Y2K scare, which prompted a global focus on software compatibility and updates, have inadvertently had a silver lining? Perhaps the rigorous testing conducted back then unexpectedly protected these legacy systems from a vulnerability in the Crowdstrike update. This incident highlights the complex interplay between security updates, outdated systems, and the potential for unforeseen consequences.

When I thought about it, I envisioned and created this meme

One thing is clear as I have always said when I was responsible for the Product Security Champion program:

Security should never be an afterthought. IT/Security should be well-funded in organizations as it will cost companies 10x more in brand reputation, loss of sales, and remediating the situation.

Why the Crowdstrike Outage Matters to Project Managers and Agile Teams

The recent Crowdstrike outage serves as a cautionary tale for project managers and agile professionals, highlighting the critical role they play in ensuring secure and smooth deployments. Here's why understanding this incident is crucial:

  • Identifying Risks Early: Project managers and agile teams are responsible for risk assessment throughout the software development lifecycle. The Crowdstrike outage emphasizes the importance of factoring in potential security patch vulnerabilities into the risk matrix. By proactively identifying these risks, teams can implement mitigation strategies like rigorous testing in non-production environments.
  • Championing Robust Testing: Agile methodologies promote iterative development with frequent testing cycles. This incident reinforces the need for thorough testing, especially for security patches, before pushing them to production. Project managers can advocate for sufficient time and resources to be allocated for testing, ensuring quality and minimizing the risk of widespread outages.
  • Effective Communication: Clear and transparent communication is a hallmark of successful agile projects. The Crowdstrike outage underscores the importance of seamless communication between development, testing, and security teams. Project managers can facilitate open communication channels, fostering collaboration and ensuring everyone is aware of potential issues before deployment.
  • Importance of Non-Production Environments: The incident highlights the dangers of deploying updates directly to production environments. Project managers can champion the use of separate, controlled environments (dev, test, stage) for rigorous testing before pushing updates live. This allows for the identification and resolution of bugs before they impact critical systems.
  • Embracing Post-Launch Testing: Testing shouldn't end once a release goes live. Project managers can work with agile teams to establish processes for post-launch monitoring and testing. This can help identify any unforeseen issues and enable rapid response if necessary.

By understanding these lessons from the CrowdStrike outage, project managers and agile professionals can become valuable assets in their organizations. They can champion secure practices, facilitate clear communication, and ensure a smooth and efficient patching process, minimizing the risk of outages and fostering trust with stakeholders.

Remember, by being proactive and organized, with checklists and clear communication channels, you can become a superstar for your team, preventing issues and ensuring successful deployments. Again, why Atlassian products are always my partner in crime with documentation. I can't tell you how many times using Jira/Confluence prevented mishaps.

#atlassian #atlassiancreator #cloudstrike #projectmanagement #testing #QAT #releasemanagement #agile #scrum #jira #confluence

Valeriana Colón, Ph.D.

Learning Scientist | Future-proofing orgs with IT process innovation

2 个月

Excellent points! Emphasizing separate environment testing and clear communication can prevent such incidents in the future.

回复

要查看或添加评论,请登录

社区洞察