登录查看更多内容

Don't Blame CloudStrike for a Lack of Preparedness

Paul Fries

Principal Cloud Architect @AWS | Advisor | Speaker | Educator | Student | Volunteer

发布日期: 2024年7月22日

The recent global IT outage caused by a software patch from CrowdStrike is a reminder of why resiliency planning for critical infrastructure is absolutely essential. This incident, which disrupted countless systems and industries, highlights our fragility and the urgent need for immediate attention to the current state of IT resiliency.

It is a hard but undeniable truth that while it is easy to blame a vendor in cases like this, the ultimate responsibility for maintaining a resilient infrastructure falls on the IT organization that supports it. This responsibility becomes even more critical when the IT infrastructure underpins essential business operations or societal functions like communications or healthcare.

IT leaders responsible for critical and essential societal services must ensure their infrastructure is designed with redundancy and resiliency at its core.

Understanding the Outage

The outage was triggered by a software update that affected Microsoft Windows-based systems worldwide, causing significant disruptions in healthcare, aviation, emergency services, and more (Fitch, Schechner, & Needleman, 2024). Within minutes, corporate laptops turned into unusable bricks, halting operations at restaurants, media companies, and even 911 call centers. The disruption spread rapidly, affecting tens of thousands of flights and leaving professionals stranded with blue screens on their laptops (Fitch et al., 2024). This incident underscores the vulnerabilities in our interconnected IT systems and the critical need for robust infrastructure resilience strategies.

Embracing the Philosophy of "Everything Fails All the Time"

Werner Vogels, CTO of Amazon Web Services, famously stated, "Everything fails all the time" (Vogels, 2020). This principle should be fundamental in designing resilient IT infrastructures. By anticipating failures and building systems that can handle and recover from them, organizations can ensure continuous availability and reliability of their critical services.

Strategies to Enhance IT Resiliency

Reflecting on past experiences, I recall when we once categorized certain servers or infrastructure components as "business critical." These servers/components were deemed so essential that any outage would result in significant disruptions. However, in today's IT landscape, this approach is as outdated as it is risky.

Show me a business critical server or IT component, and I will show you a single point of failure.

Here are some key strategies to avoid such pitfalls and enhance resiliency:

Blue/Green Deployment Patterns:

This involves running two separate but identical production environments: one with the current version (blue) and one with the new version (green). Apply all changes (patches, for example) to the green deployment and test. Once production traffic is directed at that environment, it can be easily redirected back to the blue environment in the case of a failure (Humble & Farley, 2010). A number of AWS services support this pattern including AWS Auto Scaling, AWS Elastic Beanstock, AWS OpsWorks, AWS CloudFormation, Amazon ECS, and Amazon EKS.

Diverse Operating Systems and Platforms:

Avoid relying on a single operating system or platform. Diversify IT environments by incorporating multiple operating systems and platforms, reducing the risk of widespread outages and enhancing overall system resilience (Cappos et al., 2016).

领英推荐

The 8 Expert-Backed Secrets to Balancing System Uptime…

Michael Ferrara 4 个月前

Cloud Resilience: Learning from the 'CrowdStrike…

Stu Lees 8 个月前

TELUS Business helps FrontWell build a scalable…

TELUS Business 8 个月前

Auto-Scaling and Self-Healing Mechanisms:

Use technologies like Auto Scaling groups in AWS to improve recovery mechanisms. Auto Scaling adjusts capacity based on demand, ensuring availability and performance. Implement self-healing mechanisms that automatically detect and recover from failures, further enhancing resiliency (Amazon Web Services, 2020). Additionally, updating the golden image from which instances in the Auto-Scaling group are sourced provides a mechanism for rapidly deploying remediations across the infrastructure.

Managed Services:

In AWS, customers can mitigate OS-specific vulnerabilities by using a variety of managed services. For example, those using Amazon Relational Database Service (RDS) for Microsoft SQL were not affected during the recent outage. Leveraging managed services offloads the need for customers to manage and maintain the underlying operating systems, thereby enhancing system resilience.

Proactive Monitoring and Incident Response:

Implement continuous monitoring of IT systems and proactive incident response plans. Advanced monitoring tools and robust incident response teams help detect and mitigate issues early, minimizing their impact (Ali et al., 2021). AWS offers many services in this area such as Amazon Cloudwatch Synthetic Monitoring (canaries) and Amazon Route 53 Health Checks and Application Recovery Controller.

Shared Responsibility for Resiliency

While CrowdStrike's software update triggered the recent outage, responsibility must also be shared by IT organizations that failed to incorporate the necessary resiliency mechanisms. Relying solely on a single vendor or system without robust failover strategies exposes organizations to significant risks. IT leaders must ensure their infrastructure is designed with redundancy and resiliency at its core.

Conclusion

The CrowdStrike incident is yet another reminder of the vulnerabilities present in our interconnected IT systems. It is imperative for organizations to prioritize infrastructure resiliency to safeguard their operations and critical services. By adopting strategies such as those described here, organizations can build more robust and resilient IT infrastructures capable of withstanding future disruptions.

If your environment was impacted by the recent outage or if you just want to learn more about enhancing your infrastructure's resiliency, please contact your AWS representative to explore options that are ideal for your unique environment.

References

Amazon Web Services. (2020). Auto Scaling Groups. Retrieved from https://aws.amazon.com/autoscaling/
Amazon Web Services. (2022). Amazon RDS. Retrieved from https://aws.amazon.com/rds/
Ali, M., McArthur, C., Abbas, A., & Dar, K. (2021). Proactive incident response in modern IT infrastructures. Journal of Cybersecurity, 7(2), 223-240.
Cappos, J., Samuel, J., Baker, M., & Hartman, J. H. (2016). A secure by design approach to operating system diversity. Communications of the ACM, 59(6), 56-64.
Fitch, A., Schechner, S., & Needleman, S. E. (2024). The software patch that shook the world. The Wall Street Journal. Retrieved from https://www.wsj.com/tech/cybersecurity/crowdstrike-outage-software-patch-78d05df2
Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley.
Vogels, W. (2020). Everything fails all the time. Retrieved from https://www.allthingsdistributed.com/2020/05/everything-fails-all-the-time.html

Clinton Black

Integration Engineer 3 - Unisys

7 个月

Well said!

Paul Stalder

Technical Consultant, Data Center & Cloud Transformation at NTT DATA Services

8 个月

Sometimes it takes a major event like this to identify non-resilient applications and redirect our train of thought. I am still considering how some of our resiliency tools could have been used to more effectively restore service. Understanding the impact on data (or data loss) for each application is still a gap that is difficult to understand and that is a critical decision point. Especially for older legacy applications. Scaling teams quickly is another area that was identified by this issue (think bitlocker keys or hardware console remote access).

2 次回应

Mehdi Daoudi

Driving Internet Innovation | Expert in Monitoring, Reliability, Web Perf & Resiliency | Champion of Customer & Workforce Experiences | Collaborating with Internet Builders & Operators to Deliver Exceptional Results

8 个月

Well said Paul! So true!

2 次回应

Chris S.

8 个月

Well said.

查看更多评论

要查看或添加评论，请登录

Paul Fries的更多文章

A Treasure Trove of Talent: The Silver Lining in AWS's Recent Reduction in Force

2023年4月28日

A Treasure Trove of Talent: The Silver Lining in AWS's Recent Reduction in Force

In the fast-paced world of technology, change is inevitable. With the recent reduction in force at Amazon Web Services…

4 条评论
Measuring Success

2018年12月12日

Measuring Success

Today my son Dylan tied his shoes. The knots were a bit loose and a little sloppy, but he did it.

11 条评论

Don't Blame CloudStrike for a Lack of Preparedness

Paul Fries

Principal Cloud Architect @AWS | Advisor | Speaker | Educator | Student | Volunteer

Understanding the Outage

Embracing the Philosophy of "Everything Fails All the Time"

Strategies to Enhance IT Resiliency

领英推荐

Shared Responsibility for Resiliency

Conclusion

References

Paul Fries的更多文章

社区洞察

其他会员也浏览了

Why having a modern IT infrastructure is crucial for success ??

The Resilience Report

Managing Data Center Connectivity to Ensure Business Continuity

How to Avoid Global IT Outages

UNDERSTANDING SYSTEMIC VULNERABILITIES: THE FRAGILE DYNAMICS OF TELECOM NETWORKS

A Digital House of Cards: Building Resilience in the Face of Chaos

Building Resilience: Safeguarding Financial Services in the Digital Age

Navigating the Tempest: Understanding the Implications of Data and Telecommunication System Downtime on Resilience and Continuity – the Optus Example.

Important Update: Major Cloud Service Outage and Falcon Sensor Issues - July 18, 2024

Balancing Innovation and Risk in Digital Infrastructure

Understanding the Outage

Embracing the Philosophy of "Everything Fails All the Time"

Strategies to Enhance IT Resiliency

领英推荐

Shared Responsibility for Resiliency

Conclusion

References

Paul Fries的更多文章

A Treasure Trove of Talent: The Silver Lining in AWS's Recent Reduction in Force

Measuring Success

社区洞察

其他会员也浏览了

Why having a modern IT infrastructure is crucial for success ??

The Resilience Report

Managing Data Center Connectivity to Ensure Business Continuity

How to Avoid Global IT Outages

UNDERSTANDING SYSTEMIC VULNERABILITIES: THE FRAGILE DYNAMICS OF TELECOM NETWORKS

A Digital House of Cards: Building Resilience in the Face of Chaos

Building Resilience: Safeguarding Financial Services in the Digital Age

Navigating the Tempest: Understanding the Implications of Data and Telecommunication System Downtime on Resilience and Continuity – the Optus Example.

Important Update: Major Cloud Service Outage and Falcon Sensor Issues - July 18, 2024

Balancing Innovation and Risk in Digital Infrastructure