The CrowdStrike Incident: A Wake-Up Call

The CrowdStrike Incident: A Wake-Up Call

In the rapidly evolving landscape of technology, ensuring the reliability, availability, and security of critical systems is paramount. In this fast-paced world, even the most robustly secure systems can experience unexpected disruptions as many IT teams and professionals became painfully aware of last week.

On July 19th, 2024 at approximately 12:00am (EST) CrowdStrike released an update (Channel file 291) for its Falcon sensor that triggered a logic error within the Microsoft Windows operating systems causing it to crash and blue screen. This caused outages worldwide impacting healthcare, banking, emergency, and transportation services. Insurers have estimated $5 billion in damages with 8.5 million systems impacted.

Understanding the Root Cause

At a bird’s eye view, what makes CrowdStrike’s falcon sensor uniquely different compared to other EDR solutions is that it operates in ‘kernel’ mode within the operating system, whereas others operate in ‘user’ mode. This allows for the software to detect potential threats that may run deeper within the OS- a far more capable defense although making it more capable of causing problems. The short version is if an application crashes in user mode the application crashes, if the application crashes in kernel mode the system crashes. This event underpins the importance of IT professionals holistically understanding their environments to adequately protect their assets.

Source:

Lessons Learned: The Importance of Testing

The CrowdStrike event also exposed many holes within the Disaster and recovery procedures of many companies still struggling to recover from the event. Two critical methods for assessing the readiness and resilience of IT systems are live testing and tabletop testing. Each approach has unique benefits and serves distinct purposes. Understanding these differences is essential for IT professionals seeking to fortify their systems against potential threats.

Live vs. Tabletop Testing: Assessing Your IT Resilience

Live testing involves simulating real-world scenarios to evaluate the performance, security, and resilience of IT systems. This method engages the actual infrastructure and applications, subjecting them to stress tests, penetration tests, and other live simulations.

Live simulations

Real-World Scenarios - Live testing provides a realistic assessment of how IT systems perform under actual conditions. This can uncover vulnerabilities and weaknesses that might be missed in a more controlled environment.

Comprehensive Evaluation - It allows for a thorough examination of system responses to various threats, including cyberattacks, hardware failures, and high traffic loads. This helps in identifying performance bottlenecks and security gaps.

Hands-On Experience - IT personnel gain practical experience in dealing with real incidents, which enhances their skills and confidence in managing actual emergencies.

Tabletop exercises

Tabletop testing involves key IT personnel discussing and simulating hypothetical scenarios in a controlled, discussion-based environment. This method focuses on decision-making processes, communication, and coordination without involving actual systems.

Advantages of Tabletop Testing

Focus on Strategy - These exercises emphasize strategic planning and communication, helping to identify gaps in incident response plans and decision-making processes.

Ease of Organization - Tabletop exercises are easier to organize and can be conducted more frequently, allowing for regular review and improvement of IT strategies.

Tabletop exercises, while useful for planning and communication, rely on assumptions about system responses and personnel actions that may not reflect the complexity of a real-world situation. They can overlook practical issues like hardware failures and unpredictable behavior, such as with CrowdStrike. This potentially leads to a false sense of security. Live simulations are essential to uncover these real-time vulnerabilities and ensure comprehensive preparedness.

Industry Impact and Lessons Learned

The CrowdStrike incident highlights the critical need for robust IT resilience across industries.. As organizations become increasingly reliant on technology, the consequences of system failures can be far-reaching.

This incident underscores the importance of third-party software security and the need for rigorous testing protocols. The preference should lie with software vendors who prioritize security and reliability in their development processes to minimize the risk of catastrophic failures.

Recommendations for IT Professionals

To enhance IT resilience and mitigate risks, IT professionals should:

  • Invest in comprehensive testing programs, including both live and tabletop exercises.
  • Develop robust incident response plans and conduct regular training for staff.
  • Stay informed about emerging threats and vulnerabilities through industry updates and intelligence sharing.
  • Foster a culture of security awareness and training among employees.

By following these recommendations, organizations can significantly improve their ability to withstand and recover from IT disruptions.

The Path Forward: Building a Robust IT Strategy

Both live testing and tabletop testing are vital components of a robust IT system assessment strategy. By leveraging the strengths of each approach, professionals can ensure comprehensive preparedness against potential threats. A balanced combination of live testing and tabletop exercises offers the best of both worlds; realistic assessments of technical performance and strategic evaluations of decision-making processes. This holistic approach is essential for maintaining the security, reliability, and resilience of modern IT systems.

The CrowdStrike incident underscores the fragility of our increasingly interconnected business world. While the immediate focus was on IT systems, the ripple effects extended across various industries. For shippers, the incident highlighted the importance of robust IT infrastructure and disaster recovery planning. Those heavily reliant on APIs for logistics operations might have been particularly vulnerable. It's crucial to maintain a balanced approach, combining API features with on-platform connectivity options and diverse technology solutions to ensure business continuity.

Special thanks to James Green , Michael McLaughlin , Justin Cramer , and Jeffrey Lukaszewski for their expertise.


Dave Salter

Business Development @ ProShip, Inc. | Board Member ASCM-Wisconsin

3 个月

Thanks for the insightful explanation real-world testing vs. tabletop testing. As Justin Cramer pointed out "companies to be more prepared to react and recover".

回复
Justin Cramer

Small Parcel Multi-Carrier Shipping software company co-founder and executive.

3 个月

It should be noted that neither real-world testing nor tabletop testing would have prevented companies from experiencing the CrowdStrike issues. But it would have made companies more prepared to react and recover in a more professional and reduced-stress manner (note that does not mean stress-free).

要查看或添加评论,请登录

社区洞察

其他会员也浏览了