Don't Blame CloudStrike for a Lack of Preparedness

The recent global IT outage caused by a software patch from CrowdStrike is a reminder of why resiliency planning for critical infrastructure is absolutely essential. This incident, which disrupted countless systems and industries, highlights our fragility and the urgent need for immediate attention to the current state of IT resiliency.

It is a hard but undeniable truth that while it is easy to blame a vendor in cases like this, the ultimate responsibility for maintaining a resilient infrastructure falls on the IT organization that supports it. This responsibility becomes even more critical when the IT infrastructure underpins essential business operations or societal functions like communications or healthcare.

IT leaders responsible for critical and essential societal services must ensure their infrastructure is designed with redundancy and resiliency at its core.

Understanding the Outage

The outage was triggered by a software update that affected Microsoft Windows-based systems worldwide, causing significant disruptions in healthcare, aviation, emergency services, and more (Fitch, Schechner, & Needleman, 2024). Within minutes, corporate laptops turned into unusable bricks, halting operations at restaurants, media companies, and even 911 call centers. The disruption spread rapidly, affecting tens of thousands of flights and leaving professionals stranded with blue screens on their laptops (Fitch et al., 2024). This incident underscores the vulnerabilities in our interconnected IT systems and the critical need for robust infrastructure resilience strategies.

Embracing the Philosophy of "Everything Fails All the Time"

Werner Vogels, CTO of Amazon Web Services, famously stated, "Everything fails all the time" (Vogels, 2020). This principle should be fundamental in designing resilient IT infrastructures. By anticipating failures and building systems that can handle and recover from them, organizations can ensure continuous availability and reliability of their critical services.

Strategies to Enhance IT Resiliency

Reflecting on past experiences, I recall when we once categorized certain servers or infrastructure components as "business critical." These servers/components were deemed so essential that any outage would result in significant disruptions. However, in today's IT landscape, this approach is as outdated as it is risky.

Show me a business critical server or IT component, and I will show you a single point of failure.

Here are some key strategies to avoid such pitfalls and enhance resiliency:

Blue/Green Deployment Patterns:

This involves running two separate but identical production environments: one with the current version (blue) and one with the new version (green). Apply all changes (patches, for example) to the green deployment and test. Once production traffic is directed at that environment, it can be easily redirected back to the blue environment in the case of a failure (Humble & Farley, 2010). A number of AWS services support this pattern including AWS Auto Scaling, AWS Elastic Beanstock, AWS OpsWorks, AWS CloudFormation, Amazon ECS, and Amazon EKS.

Diverse Operating Systems and Platforms:

Avoid relying on a single operating system or platform. Diversify IT environments by incorporating multiple operating systems and platforms, reducing the risk of widespread outages and enhancing overall system resilience (Cappos et al., 2016).

Auto-Scaling and Self-Healing Mechanisms:

Use technologies like Auto Scaling groups in AWS to improve recovery mechanisms. Auto Scaling adjusts capacity based on demand, ensuring availability and performance. Implement self-healing mechanisms that automatically detect and recover from failures, further enhancing resiliency (Amazon Web Services, 2020). Additionally, updating the golden image from which instances in the Auto-Scaling group are sourced provides a mechanism for rapidly deploying remediations across the infrastructure.

Managed Services:

In AWS, customers can mitigate OS-specific vulnerabilities by using a variety of managed services. For example, those using Amazon Relational Database Service (RDS) for Microsoft SQL were not affected during the recent outage. Leveraging managed services offloads the need for customers to manage and maintain the underlying operating systems, thereby enhancing system resilience.

Proactive Monitoring and Incident Response:

Implement continuous monitoring of IT systems and proactive incident response plans. Advanced monitoring tools and robust incident response teams help detect and mitigate issues early, minimizing their impact (Ali et al., 2021). AWS offers many services in this area such as Amazon Cloudwatch Synthetic Monitoring (canaries) and Amazon Route 53 Health Checks and Application Recovery Controller.

Shared Responsibility for Resiliency

While CrowdStrike's software update triggered the recent outage, responsibility must also be shared by IT organizations that failed to incorporate the necessary resiliency mechanisms. Relying solely on a single vendor or system without robust failover strategies exposes organizations to significant risks. IT leaders must ensure their infrastructure is designed with redundancy and resiliency at its core.

Conclusion

The CrowdStrike incident is yet another reminder of the vulnerabilities present in our interconnected IT systems. It is imperative for organizations to prioritize infrastructure resiliency to safeguard their operations and critical services. By adopting strategies such as those described here, organizations can build more robust and resilient IT infrastructures capable of withstanding future disruptions.

If your environment was impacted by the recent outage or if you just want to learn more about enhancing your infrastructure's resiliency, please contact your AWS representative to explore options that are ideal for your unique environment.

References



Clinton Black

Integration Engineer 3 - Unisys

7 个月

Well said!

回复
Paul Stalder

Technical Consultant, Data Center & Cloud Transformation at NTT DATA Services

8 个月

Sometimes it takes a major event like this to identify non-resilient applications and redirect our train of thought. I am still considering how some of our resiliency tools could have been used to more effectively restore service. Understanding the impact on data (or data loss) for each application is still a gap that is difficult to understand and that is a critical decision point. Especially for older legacy applications. Scaling teams quickly is another area that was identified by this issue (think bitlocker keys or hardware console remote access).

Mehdi Daoudi

Driving Internet Innovation | Expert in Monitoring, Reliability, Web Perf & Resiliency | Champion of Customer & Workforce Experiences | Collaborating with Internet Builders & Operators to Deliver Exceptional Results

8 个月

Well said Paul! So true!

Chris S.

Principal Practice Manager Amazon Web Services (AWS) | Builder of Cloud Ninjas | 6x AWS Certified | Former Principal | Mentor | Career Champion

8 个月

Well said.

回复

要查看或添加评论,请登录

Paul Fries的更多文章

社区洞察

其他会员也浏览了