Funding software resiliency in a Hyper-Growth Company - CH01 Motivation
[The entire blog is a work of my personal experiences and opinions. There is no association to my affiliations or my employers.]
Why is software resilience not free?
The discussion of software resiliency is closely tied to software reliability, as both concepts are interrelated. In simple terms, software reliability is the process of making sure the software functions correctly without any interruptions while resiliency is the ability of a software to recover gracefully from interruptions and quickly recover from outages and issues.
The software development life cycle spans from a basic process like 'writing code and deploying it into the product' to more comprehensive steps like 'gathering requirements, writing code, QA process, load tests, metrics & monitors, rollout/rollback plan, and post-launch monitoring', and beyond.
Every new step in this transition process comes with a cost which can lead to additional reliability or resiliency. It is very important for us to understand the cost of addition of steps in between so that we can balance with the return on investment (ROI) it reaps. Let’s walk through a case study,
Case study - Adding unit tests and integration tests: A software doesn’t need to have automated tests to operate, we can still ship it without tests. In that case we have two risk? acceptance options to choose from 1) Allow a change to break it, which will lead to bad CX or 2) Do manual testing with every release
What did we save? - X Developer time from writing tests.?
What are we paying for?
There is no wrong or right approach here. The question is what is the right risk acceptance approach for the company priorities given the stage. For example,
Why does Risk acceptance matter?
Everything comes down to the risk appetite and the willingness to accept that risk within the company. I believe that every bug, Infosec issue, privacy concern, or even steps towards reliability and resiliency falls within the spectrum of risk acceptance.
领英推荐
For example, imagine a critical zero-day bug that a hacker could exploit to gain control of the system. As leaders, we might choose to accept this risk and not prioritize it. However, the risk profile of such a scenario can be calculated in a 2D space,
So, if the company is an early stage company (Probability of being exploited: incredible, Severity: Catastrophic), it still leads to acceptable risk. While a company is a Unicorn (Probability: Frequent, Severity Catastrophic) is an unacceptable risk for the business. It can pose an existential risk to the company.?
To summarize, risk acceptance helps a company classify the fires which can “burn a piece of paper in the bin” vs “burn the entire house.” In an ideal world, we would not have any risks to the organization, however that requires significant investments. In practical terms,? Engineering leaders should be carefully accepting the right risks keeping the company’s resources and stage in mind.
What makes a Hyper-growth company special?
The term hyper-growth first appeared in the Harvard Business Review in 2008. According to the World Economic Forum, compound annual growth rates (CAGR) above 40% define hyper-growth. Hypergrowth is at least double the rate at which a company’s growth can be considered rapid (20% CAGR), which is itself very fast. (Source )
These companies are special as they are undergoing a transformation from being a company - “Nobody cares about them” to a company “An outage caused a public outcry”. The leaders may realize, “move fast and break things” are not going to work anymore. To build long lasting user trust, they need to release new software with very high confidence.
Leadership may be vulnerable to complacency, assuming that what brought them to this point will also carry them forward. In other words, let’s continue to add more user value with more functionalities and features. However, now the user base is huge and there are millions of eyeballs on the roll especially if it is a public company. The engineering organization is large, and the company is at top speed to acquire new users and deliver great features.
Incidents are inevitable, even gold standard AWS services face incidents. The incidents can be associated mainly under two categories either human error or process failures and can be further classified into Infrastructure, Third party or Software level incidents. So, it is common for the hyper-growth company to prioritize new processes in place to ship reliable and resilient software. The company and leaders also realizes it will lead to slowing down the pace of their engineering with the same number of resources.?
This raises the question of risk acceptance and finding the right balance between the new processes and controls in place and the benefits they offer in order to manage the risks.
Final words
We've completed the motivational part in this first chapter of the series. We will continue the discussion about funding resiliency efforts, where we should start and how we measure success in the future.?
Feel free to drop your thoughts and comments below, until next time signing off.