Resiliency, when shared responsibility goes wrong.
The Crowdstrike - Microsoft saga is still unfolding. The fall-out and the associated cost still need to be calculated. Windows is ubiquitous, and it runs a majority of the consumer facing platforms - displays, reservations, register etc. In fact, if this was isolated to the back-office platforms, the fall out may have been kept off the news and potentially "swept under the rug".? But with customer facing platforms impacted, the dreaded BSOD was out there for the world to see. The world saw it, the world felt the pain and they absolutely do not want to relive it.? Security teams around the world were dragged into the proverbial "principal's office" to explain why a non-security incident tied to a security product resulted in business downtime.
Did we see this coming? Well, yes, in a way.? Is this a risk we can mitigate? Let's walk through that.?
Ignoring the irony of it all, the current situation we are in requires corporate teams to take control of and find ways to mitigate the shared responsibility risks to our business operations. Leadership teams will be talking about tech resiliency and business continuity plans. Insurance teams are going to assess the insured for impact, recovery time and revenue loss in their next assessment. Get ready for some very serious risk tolerance decisions. I recently wrote an article on Shared Security responsibility struggles. The article focused on how teams could manage shared responsibilities when working with multiple security teams with varying security goals, following distinct policies and standards. Those sets of guidance were designed to pick the right partner, a partner that takes security seriously and a partner you could trust your business with.? It is indeed true that trust is earned in drops but lost in buckets. Is it time we TRUST, but VERIFY the processes our partners have deployed? Is it time we creatively think about how our share of the "shared responsibility model" can compensate for the risks introduced by our partners?
Understanding the partners CI/CD process: One of the cardinal rules of pushing system updates is the ability to test the finished product on a sample subset of machines. Canary testing is something we do routinely, when we patch for vulnerabilities. Push the update to a sample group, test and validate that there are no issues and then roll it out to the rest of the community. Naturally, this helps with any issues impacting mission critical systems and reduce the blast radius from a rogue update. Understanding the vendor's testing process and having the ability to test the final product on a sample group, and validating the outcome before pushing to the rest of the environment is a sure way to avoid unexpected surprises. ?
领英推荐
Fast detection, decisive response and recovery plans:? What happened Friday was a Tech Resiliency challenge introduced by a faulty security solution. Could we have prepared for this? In a way this is essentially what security teams have been preparing for since the Ransomware scare. The many Table-Top-Exercises, data back up and tech resiliency plans were all done in preparation for malware "bricking" your systems and making off with your data. In this case, the outcome was essentially the same and your response plans would work just as well in this scenario. But what if this was NOT a security tool causing grief, could our processes adapt? Detecting the incident was a challenge. Global organizations had an advantage here. Regional teams identified and escalated the incident to the right level. But, for others localized to the US, this was a rude wake up call. The IR teams responded, identified the issue, but did not have much of a role after the initial containment and recovery directives. The core of the recovery process required multiple operations teams from across the enterprise in a coordinated effort to tackle bringing systems back online. It also required many technicians to visit offices, stores and other spaces to manually restore devices. With the large swath of impacted businesses, tech teams would be stretched, and larger contracts and retainers took precedence. Having an Incident Response plan in place with a team that can coordinate efforts across the operational environments is an absolute requirement. The team should have the ability to stand up a Crisis Management Team, operating from a well-documented recovery plan and bring to bear the right operations teams. In this case, fast detection and response was NOT enough to reduce business down time. Executing the recovery plan was imperative.
Diversifying your tech stack: Putting all your "eggs in one basket" has always been frowned upon by the security community. But with businessess looking for economies of scale and "betting big with big partners", there has been reduced diversity in our tech landscape. It is a balancing act where cost savings realized by going all-in with one partner may be outweighed by the possibility of unacceptable business downtime. The lack of diversity in your tech stack should factor in the acceptable risk your business is willing to take in the event of a total failure. There are various opportunities to increase tech diversity within your enterprise, especially in the revenue generating segments of your business. Additionally, requiring your partners to share their resiliency plans and making sure it meets your standards for tech resiliency can go a long way in maintaining uptime.
Can we live on a browser?? With businesses driving towards a SaaS consumption model, we no longer need high-powered user laptops with large hard drives. Many of our daily apps are deployed SaaS, available via the browser and accessed from something as small as an iPad or chrome book. With identity providers enabling easy to use application landing pages, combined with cloud-based data storage, the consumer experience is rather frictionless. There are mature enterprise browsers that can effectively impose the same level of control, access and usability as you would with your laptops. And in the event of a disaster, with the right security policies in place, even BYOD devices can be recruited to hold-the-fort as we recover company owned/managed systems.
In closing, we must accept the reality that businesses worldwide rely on 3rd party solutions to enhance, optimize and protect their business operations. In the process we are beholden to the concept of shared responsibility. While we may not have visibility and control over our partners processes and operational rhythms, we can absolutely deploy capabilities that can mitigate risk to our own business operations.?
Chief Product Officer & Co-Founder at Kovrr
7 个月Great write-up. The CrowdStrike incident was a huge wake-up call for those organizations that have not yet invested an adequate amount of resources into proactive cyber risk management. Completely agree there's not much - if anything - that could have been done to avoid the issue totally if utilizing Falcon as the EDR. However, being more cognizant and taking the necessary mitigation action beforehand would have significantly reduced the impact. Perhaps business leaders did not realize the full, tangibly weight they would bear deprioritizing cybersecurity. It's certainly felt now. Thanks for sharing!
Cyber Security Leader, LifeLong Learner, and Army Veteran
7 个月Great insights . Quite similar when businesses thought moving to cloud was secure and was the responsibility of the Cloud Service provider alone .