How “resilience as code” empowers continuous reliability in complex systems
Ganesh Seetharaman
GLOBAL IT Executive Results Driven Leader: CONSULTING PARTNER? PRACTICE OFFERING LEAD ?CIO Advisor ? CTO Advisor
Major outages have shown how complex tech stacks and an interconnected world can put companies—and consumers—in a fraught position. Here’s how continuous approach through automation can protect against disaster.
—
Financial services platforms that go dark during critical trading hours.[1] Point-of-sale systems that leave businesses unable to process transactions.[2] Software updates that cause entire swaths of the internet to shut down—and ground thousands of flights in the process.[3] Recent high-profile outages have cast a spotlight on the fragility of our increasingly interconnected digital ecosystems, and the ripple effects of downtime are far-reaching and costly. Industry research suggests that it can cost organizations up to $5 million per hour,[4] a figure that doesn't account for long-term brand damage and customer churn.
What’s happening? For one, modern tech stacks are growing more complex. As firms leverage a mix of on-premises infrastructure, cloud services, and third-party solutions, they're creating intricate webs of dependencies that can be challenging to manage and monitor effectively.[5] The complexity is compounded by distributed computing architectures that span multiple environments, rapid release cycles that introduce frequent changes, increasing reliance on third-party services and APIs, and the growing sophistication of cyber threats.
The “always-on” imperative
Traditional approaches to resilience—often reactive and siloed—don’t work anymore. The market and customers' instant gratification expectations are shifting towards an "always-on" mindset.[6] This shift is putting pressure on executives responsible for operational stability: they’re managing increasingly complex systems while helping ensure the uninterrupted delivery of mission-critical digital services.
Think about a retail scenario where a customer wants to make a purchase. They expect an end-to-end view of the process, a personalized experience based on their history, and optimal pricing—at all times. To deliver this, companies should build robust solutions that integrate internal systems with third-party services. But this integration introduces new vulnerabilities.
If a cloud provider experiences an outage or a security service, a business faces issues, the impact cascades down to the end customer, potentially resulting in lost sales and damaged brand reputation.[7] This scenario obviously isn't unique to retail; industries such as financial services, healthcare, and insurance face their own versions of this challenge.
Defining “resilience as code”
Companies need a different approach to continuous reliability. This is where the concept of "resilience as code" comes into play.
It embeds reliability and recovery capabilities directly into the software development lifecycle. It also leverages automation, continuous testing, and real-time monitoring to create systems that are inherently more robust and adaptable to change.[8] Resilience as code is about automating everything through code, not just for single stages in the development lifecycle, but across the entire process. This has the added benefit of reducing the cognitive overload and fatigue plaguing today's developers and operations teams.
Resilience as code standardizes reliability measures across the technology stack, proactively identifies and mitigates potential failure points, and enables rapid, controlled recovery from incidents. By codifying resilience, organizations can move from a reactive stance to a proactive one, where potential issues are identified and addressed before they impact customers or business operations.
This approach helps uncover potential conflicts or architectural trade-offs early in the development process. It addresses the challenges posed by the VUCA principle,[9] managing the conflict between stability and resilience in an automated fashion. By doing so, it allows organizations to maintain customer stickiness, protect their reputation, and help ensure agility and time-to-market without compromising on stability and resilience.
The five tenets of resilience as code
To implement resilience as code effectively, companies need to focus on five key tenets.
Fitness functions for architectural integrity. These automated checks help ensure a system adheres to its intended architectural principles, acting as guardrails that prevent drift and maintain overall system health. They offer key benefits such as proactively managing technical debt, ensuring consistency across distributed systems, and providing early warning of potential architectural issues.[10]
Fitness functions capture or codify both functional and non-functional requirements for an architecture or solution, serving as a metric for product or platform teams to manage drift, sprawl, and technical debt—thereby maintaining high solution hygiene.
The adoption of comprehensive metrics. Effective resilience requires a clear understanding of system performance and reliability. By adopting a comprehensive set of metrics, organizations can gain visibility into all aspects of their technology stack. This includes establishing rollout strategies for increased releases, reliability scores for solutions, and measurements for unknown factors.
These metrics directly or indirectly support mean time for triaging, repairing, recovering, or early detection of problems.[11] They also improve visibility within solutions, enabling better problem explanation when issues arise. If scores or metrics decline, it signals to solution leaders or companies where to address coverage gaps, blind spots, or potential landmines in their systems.
Controlling blast radiuses and minimizing attack surfaces. In an era of increasing cyber threats, limiting the potential impact of any single failure or breach is crucial. This tenet focuses on creating boundaries and isolation zones within systems. It involves implementing strategies to minimize the impact on certain functions or services, ensuring that user journeys or surrounding impacts are limited.[12]
The objective is to perform trade-off analysis, avoid bias, and enable graceful shutdown or failure. This approach helps protect customers, prevent revenue loss, and control the situation without necessarily entering full recovery mode.
Chaos testing for unknown unknowns. Inspired by Chaos Monkey,[13] chaos testing involves deliberately introducing controlled failures into a system to uncover weaknesses and improve overall resilience. It's about orchestrating and creating scenarios for unknown unknowns, simulating disruptions within the environment before releasing to production.
This practice should be integrated across the entire business journey, covering applications, data, infrastructure, and processes. By bringing abstraction across all areas and tying it back to established metrics, organizations can better control blast radius and enhance overall system resilience.
Integration with CI/CD pipelines. To truly embed resilience into the development process, reliability checks, and tests must be an integral part of the continuous integration and delivery pipeline. This integration enables continuous resilience by incorporating all coding thresholds, metrics, and other elements from the previous tenets into the development lifecycle. It provides coverage and experimental coverage, allowing for the constant introduction of new testing scenarios.[14]
This approach facilitates change discipline, increases the number of features that can be released in a controlled fashion (minimizing impact), and helps control sprawl and technical debt. The result is improved standardization, easier problem-solving due to better insights, and a more streamlined approach to finding and addressing issues within complex systems.
Overcoming implementation challenges
While the potential benefits of resilience as code are clear, implementation can present several challenges.
These include subjectivity in metrics, where agreeing on what constitutes "good enough" reliability can be difficult across different teams and stakeholders. Balancing trade-offs is another challenge, as there's often tension between speed of delivery and robust testing. Finding the right balance requires ongoing dialogue and adjustment. Finally, evolving requirements pose a challenge, as resilience strategies should adapt to changing business needs, requiring a flexible, iterative approach.
To address these challenges, organizations need to establish a common language and taxonomy around resilience. This common decision-making language for discussions helps define shared objectives and integrates change management into the development lifecycle. Leaders should also foster a culture of shared responsibility for reliability and implement objective decision-making frameworks for resolving conflicts.
领英推荐
The path forward
As "always-on" expectations rise, companies must evolve their approach to resilience. Resilience as code provides a framework for this shift, embedding reliability throughout the development lifecycle and addressing regulatory compliance. With stringent regulatory policies like DORA in the EU, OCC in US and HIPAA in the US, organizations should provide solid evidence of their resilience capabilities. Automation plays a crucial role, in creating a unified approach that mitigates both technical and human risks effectively. Failure to comply could result in significant fines of certain % of total annual worldwide revenues under DORA regulations.[15]
To move forward, organizations should implement resilience principles early, enabling continuous improvement and early issue detection. Adopting scenario-based risk assessments and embracing an iterative approach are key, recognizing that continuous adaptation is necessary in an environment controlled by external factors. This approach provides a significant leap forward in designing intricate systems that can withstand the pressures of modern digital business environments while meeting regulatory requirements.
Ready to take the next step?
Deloitte's holistic framework addresses modern system resilience complexities, from architecture to compliance. We can guide you in implementing resilience as code, helping you meet regulatory requirements and business objectives. Our specialists will assess your current posture, identify improvement areas, and implement solutions that drive measurable business value.
Don't wait for vulnerabilities to be exposed or for regulatory fines to impact your bottom line. Contact us to build a robust, compliant technology foundation that meets today's demands and aligns with your business goals.
This article is part of a series on technology resilience. For further insights on building resilient operations, stay tuned for the next piece in the series.
This blog post contains general information only and Deloitte is not, by means of this blog post, rendering accounting, business, financial, investment, legal, tax, or other professional advice or services. This blog post is not a substitute for such professional advice or services, nor should it be used as a basis for any decision or action that may affect your business. Before making any decision or taking any action that may affect your business, you should consult a qualified professional advisor.
Deloitte shall not be responsible for any loss sustained by any person who relies on this blog post. As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of our legal structure. Certain services may not be available to attest clients under the rules and regulations of public accounting.
[1] Kaye, Danielle, “Thousands of Users Have Reported Online Brokerage Outages as Stocks Tumbled.” The New York Times, August 5th, 2024.
[2] Page, Carly, “Square Says It Has Resolved Daylong Outage.” TechCrunch, September 8th, 2023.
[3] Josephs, Leslie, and Yildirim, Ece, “Delta CEO Says CrowdStrike-Microsoft Outage Cost the Airline $500 Million.” CNBC, July 31st, 2024.
[4] Shepherd, Daniel, “Why DNS Exploits Continue to Be a Top Attack Vector in 2024.” TahawulTech.com, March 18th, 2024.
[5] O’Brien, Matt. “One Faulty CrowdStrike Update Caused a Global Outage.” AP News, July 19th, 2024.
[6] Vollero, Agostino, et al. “Exploring the Role of the Amazon Effect on Customer Expectations: An Analysis of User‐Generated Content in Consumer Electronics Retailing.” Journal of Consumer Behaviour, June 29th, 2021.
[7] Lim, Shawn, et al, “What the Microsoft-CrowdStrike Outage Means for Brand Reputation.” PR Week, July 22nd, 2024.
[8] “Resilient Digital Operations.” Deloitte, 2023.
[9] Loucks, Jeff. “Leading in a VUCA World.” Deloitte, October 23, 2019.
[10] Paul, Paula, and Rosemary Wang, “Fitness Function-Driven Development.” Thoughtworks, January 11th, 2019.
[11] “Choose Your Service Level Indicators.” Google Cloud, March 29th, 2024.
[12] “Ransomware: Threat Activities, Trends, and Continuing Evolution.” Deloitte, July 17th, 2023.
[13] Netflix Technology Blog, “Netflix Chaos Monkey Upgraded.” Medium, March 24th, 2018.
[14] “Why Architecting for Resilience Is More than a Technology Concern.” Deloitte, 2023.
[15] “IT Regulatory Compliance,” Deloitte.