Lessons Beyond the Crowdstrike Affair
David Neuman
CISO | Retired Senior Military Leader | Board Advisor | Adjunct Faculty | Executive Coach
?"The only real mistake is the one from which we learn nothing."? Henry Ford?
In 1996, I was on a business trip to San Antonio, Texas. Walking along the River Walk, my colleague and I found a restaurant that looked nice but was empty. Hungry, we took a chance and asked for a table for two. The host said they weren’t seating anyone because their computers were down. I offered that I wasn’t planning to eat a computer. She didn’t appreciate my humor, so I suggested we could enjoy a drink at the bar while they resolved the issue. I’m sorry, the host said we couldn’t serve drinks because their computers were down. This was 28 years ago, and I could not help but wonder why any business would place such reliance on technology or how that reliance would grow so significantly.?
On July 19, 2024, a faulty software update from CrowdStrike, a prominent cybersecurity firm, triggered a massive global technology outage. The update, intended for the Falcon Sensor security software on Microsoft Windows systems, caused approximately 8.5 million devices to crash, leading to widespread disruptions. The issue stemmed from an out-of-bounds memory read in the Windows sensor client, resulting in invalid page faults and causing systems to enter a boot loop or boot into recovery mode. This incident, described as the largest outage in IT history, affected various sectors, including airlines, banks, hospitals, and government services.?
The impacts were far-reaching and severe. Airlines experienced significant delays and cancellations as check-in and booking systems went offline, causing long lines at airports worldwide. Banks in several countries faced outages that disrupted payment systems, while hospitals struggled with appointment scheduling, leading to delays in critical care. Media outlets, particularly in Australia, could not broadcast for hours, and some emergency services in the U.S. reported issues with 911 systems. The financial damage from the outage is estimated to be at least $10 billion, highlighting the vulnerability of global dependence on a few key technology providers.?
In the months following the incident, there has been plenty of finger-pointing and calls for change, primarily technical and near-term. We will leave that commentary to others. We believe Crowdstrike is a good company, and consideration should be given to the millions of attacks they have helped thwart over the years. This isn’t from simple opinion. As a customer, from when they were a startup until seeing them in action today, we have no doubt how seriously they have taken all of this, and they are already mobilized to ensure it doesn’t happen again. However, while Crowdstrikes was the impetus for the affair, we should never waste an opportunity to learn from an incident; a deep examination and even self-reflection are necessary to improve.?
What Lessons Should We Learn??
Most will look for or stop at the immediate cause when disaster strikes. The Crowdstrike affair is an opportunity to analyze not what happened deeply but contextually why conditions existed that severely impacted such a broad range of industries. To do that, we asked deeper questions, such as why there is a need for a multi-billion-dollar cybersecurity industry. According to Fortune Business Insights, the global cybersecurity market was valued at $172.24 billion in 2023. The market is projected to grow from $193.73 billion in 2024 to $562.72 billion by 2032. Is it because threat actors are so sophisticated, or are our technology capabilities and services poorly designed and combined with implementing them in areas of high reliance that expose organizations to unacceptable risk? Is this considered a Cyber Incident? Is it reportable to a regulator? Was the impact material to the affected company? These are some of the many questions we used to identify deeper opportunities for lessons learned.?
Over-Reliance on Technology?
Businesses, governments, and societies have become reliant on technologies and the outcomes they promise. Our increasing dependence on technology without contemplating process, overhead, and staff to balance the management of this dependence makes us vulnerable to disruptions when these systems fail. This reliance extends beyond convenience and into critical operations, where the bias to streamline, cut costs, and drive profitability can lead to failures that can have severe consequences. The CEO of Delta Airlines, Ed Bastian, stated there “was nothing they could do” after the outage caused massive disruption for longer than other airlines. What they are saying is now $500 million in material impact. If Delta were so reliant on the affected technology systems, it would be wise to look at how they design and protect these technologies so there would be something they could do when such a situation happens. It brings the question of building resiliency into an operation.?
Resilience Over Speed?
This affair underscores a critical need to shift from prioritizing speed to emphasizing resilience in technology development. The Achilles heel is their very success in hyper-competitive markets, especially with technology companies. It is a common tradeoff to rush products to market to capture market share, meet investor expectations, and satisfy consumer demand. This emphasis on speed frequently compromises the robustness and reliability of solutions, leading to system failures, security risks, reputational damage, and significant financial losses. Comprehensive testing, resilient design principles, and balanced development timelines are essential for ensuring product and service reliability. A company culture that values quality over speed, robust testing procedures, and resilient design principles must be fostered to build reliable and secure products.?
Beyond technical considerations, non-technical aspects such as ethical responsibility, stakeholder engagement, and sustainable practices play vital roles in fostering resilience. Companies are ethically obligated to prioritize user safety and security, requiring transparent communication with stakeholders about the importance of resilience and adherence to standards. Sustainable business practices that invest in employee training, research and?development, and resilient infrastructure support long-term success over short-term gains. By embracing these principles, companies can mitigate risks, build trust, and create reliable, secure, and sustainable products that benefit the company and its customers in the long run.?
Single Points of Failure Architectures must not include single points of failure in processes. Examples include patient scheduling systems, 911 systems, and airline kiosks. When these critical systems go down, they can cripple entire sectors. We must design systems with redundancy and fail-safes to ensure continuity of operations.?
Erosion of Critical Thinking Hypertechnology growth has produced a generation of tech tool administrators rather than critical thinkers and problem solvers. Problem solvers must understand how to build ecosystems to drive safety, wellness, and prosperous outcomes. The emphasis on technical skills must be balanced with fostering creativity and problem-solving abilities.?
Operational Discipline Operational discipline in product development and sustainment has given way to the procrastination syndrome of a minimum viable product. Focusing on delivering just enough functionality to get by is a dangerous practice in critical systems. We must return to rigorous development and maintenance practices to ensure reliability and safety.?
领英推荐
Guiding Principles for Operational Excellence?
Not all technology problems require a technology solution; they often begin with a commitment to fundamental principles before technology is considered. These principles ensure that technology and security capabilities align with desired outcomes. Organizations can mitigate downstream costs, effort, and risks by integrating these principles into every phase—planning, design, testing, implementation, and sustainment.?
Reliance ensures that systems and processes are dependable and trustworthy. This means building technology users can consistently rely on to perform as expected without frequent failures or unexpected downtime. It involves rigorous testing and quality assurance processes to identify and address potential issues before they affect users.?
Resilience emphasizes the ability to recover and continue operations despite disruptions. It is about designing systems that can withstand and quickly recover from failures, whether due to cyberattacks, hardware malfunctions, or other unforeseen events. Resilience involves redundancy, failover mechanisms, and robust backup and recovery plans.?
Scalability guarantees that solutions can grow and adapt to increased demands. As businesses expand and user bases grow, technology must scale efficiently without compromising performance. This involves designing systems with the flexibility to handle increased loads and integrating scalable infrastructure from the outset.?
Superior customer experience focuses on delivering exceptional value and satisfaction to users. This principle highlights the importance of user-centric design, ensuring that technology is intuitive, accessible, and meets users' needs. It involves continuous user feedback and iterative improvements to enhance the overall experience.?
Measurability allows for tracking and assessing performance and effectiveness. Organizations can monitor the success of their technology solutions by establishing clear measures and identifying areas for improvement. Measurability ensures that outcomes are quantifiable, providing a basis for data-driven decision-making.?
Operational Discipline involves maintaining high standards and consistency in development and maintenance. This principle emphasizes the importance of thorough documentation, adherence to best practices, and regular reviews to ensure that technology solutions are developed and maintained with precision and care.?
Continuous improvement commits to ongoing enhancements and refinements. It acknowledges that technology is never static and that there is always room for improvement. Continuous improvement involves regularly updating and refining systems based on user feedback, technological advancements, and changing business needs.?
These principles allow organizations to effectively address challenges and create robust, reliable, and user-centric solutions. These principles provide a foundation for building technology that meets current needs and is resilient and adaptable to future demands.?
Final Thoughts?
These are hard problems requiring hard conversations and actions. The challenges discussed are complex and multifaceted, and addressing them will demand significant effort and dedication. Many may criticize the observations in this article as overly simplistic or misaligned with their experiences. While these critiques may hold some truth, the purpose is not to present a one-size-fits-all solution. Each organization must navigate its unique path, tailored to its specific circumstances, culture, and goals. What remains clear, however, is that it is necessary to confront these issues head-on with a commitment to resilience, innovation, and continuous improvement.?
About the Authors?
Brandon Pinzon is A seasoned leader with over 17 years of experience across technology, banking, and insurance. Brandon is an experienced CSO and Risk Executive who currently lends his expertise to safeguarding companies through his advisory efforts. He oversees a comprehensive global security program encompassing cyber defense, data protection, identity management, physical security, data privacy, and business continuity/disaster recovery. Brandon's expertise is wide-ranging, from the boardroom to the classroom, spanning data collection, computer forensics, and crafting robust security and privacy strategies for heavily regulated industries. His ability to navigate complex data systems and collaborate with?multinational corporations to establish best practices is well-recognized within the industry. This recognition is evident through his frequent speaking engagements and guest lectures while advising companies on how they can leave their mark on the industry. He plays a pivotal role in academia by actively advising on programming and curriculum, ensuring the next generation of professionals is well-equipped to navigate the dynamic landscape of cybersecurity.?
David Neuman is a cybersecurity and business leader with over 39 years of experience in cyber operations, business resiliency, and transformation. As the founder of Road Rocks Services, LLC, he provides advisory solutions, Virtual CISO services, and executive coaching. David has a proven track record of driving organizational success through robust, decisive leadership and team-focused innovation. He previously led security strategy for a $60 billion global supply chain as VP & Business Security Officer at a Fortune 50 company and served as CISO at iHeartMedia and Rackspace Hosting. His career also includes leadership roles at EY and in the U.S. Air Force, where he commanded the first cyber hunting unit and led multinational teams in various capacities. David holds advanced degrees in National Security and Strategic Studies, Security Administration and is a Certified Information Systems Security Professional.?
CEO at Aiden Technologies, Inc.
6 个月Love this, David Neuman - particularly your comment at the restaurant in 1996. David, this is a timely and important analysis on the evolving lessons from the CrowdStrike incident. It underscores the critical need for a proactive cybersecurity approach—particularly around configuration management and patch automation, which we focus on at Aiden Technologies, Inc.. We've seen firsthand how AI and intelligent automation (IA) can reduce vulnerabilities and enhance security posture. The intersection of IT and cybersecurity remains crucial for organizations to achieve resilience. Excited to continue this dialogue!
The part about operational resilience hits home as someone who works in early-stage startups. MVPs, experimentation, A/B testing, ... these concepts are taught to find product/market fit (ensure you're solving a problem that matters) but in security, we must also be mindful of potential impact on production systems. It's much easier (less risky) to iterate on out-of-band analytics solutions with exported data than to build into host OSs.
Cyber Security Specialist
6 个月Love this collab David Neuman and Brandon Pinzon! These points stuck out to me: * "Operational discipline in product development and sustainment has given way to the procrastination syndrome of a minimum viable product" To me, this screamed technical debt with no plans to ever be paid down. If the decision is made to pay it down, it costs so much more in terms of developer time, money, and innovation velocity. * "Continuous improvement commits to ongoing enhancements and refinements" The bit rot I've seen in application code is real. For long periods of time, all application functionality ceases to work. Having the discipline to, for example, make small 3rd party library updates every few weeks vs. once a year can help.
Great stuff as always Mr. Neuman.