Value of agility vs the impact of resilience?
Context
In recent years, many businesses have leveraged the scalable capabilities provided by hyperscaler platform services. The benefits are numerous: businesses can avoid the complexities of building and managing data centers, infrastructure, and various technical services such as security and networking. Technologies offering automated toolchains, cloud-based email, and cloud-based CRM enable direct consumption via internet access, allowing organizations to utilize these services without needing in-house technical expertise.
Aggregating these needs across institutions allows hyperscalers to offer even more scaled services, reducing marginal costs and employing talent to improve efficiency, productivity, and scaling capabilities seamlessly while keeping technologies current and secure.
Agility and Resilience
Organizations can focus on their core activities, achieving more iterative ("agile") development and delivering new features to employees and customers, thereby driving business outcomes. This sounds ideal, but what could be wrong with this picture? Or is there something missing?
The consolidation and scale provided by these platforms result in increased concentration. Information within an organization often passes through multiple applications and platforms (whether in-house, internet-based, or third-party). Scaled platforms can operate with six nines uptime, meaning downtime of only 0.0001% of the time – about 5 minutes in a whole year. The challenge arises when organizations must respond to unplanned downtime. Have they anticipated it sufficiently? In modeling and risk management, this could be categorized as a "black swan" event, meaning rare but significant.
Industries with high resilience expectations, such as banking (ATM withdrawals) or emergency services (mobile calls to emergency numbers), have robust systems in place to handle such events. These sectors ensure resilience because dedicated teams focus on it continually.
The Challenges
The new paradigm of leveraging platform services creates challenges due to the layering of inter-dependent technologies. This layering can occur in series (e.g., a mobile app communicating with a server that accesses cloud-based database and CRM services) or in stacked applications (e.g., an application running on the cloud using other cloud-based authentication services).
Organizations often lack full visibility or tracking of all services used by the end-platform or user application. When some platform layers degrade in performance or face uptime challenges, the critical requirement is to triage the impact, understand how it affects the business, and implement incident responses to recover as quickly as possible. This requires protocols to understand operational expectations from third parties (who may be inundated with client requests) and the ability to implement temporary workarounds at scale to manage business disruptions.
Many organizations have some level of preparedness, often tested periodically. But how much attention is given to what might be mathematically categorized as a "black swan" event?
领英推荐
The Trade-offs and Questions for Organizations
The key question is: Is this really a black swan event, given the sheer number of third-party solutions used daily? Given the stacking of applications end-to-end or on top of each other, is the cumulative likelihood still a black swan, or is it more frequent?
Organizations need to know the procedures to execute and whether they can switch to alternatives if downtime exceeds the designed thresholds of 30 seconds, 5 minutes, or 50 minutes.
Are you trading off too much resilience risk (a low-probability event) for agility, which is perceived to be more important?
Other questions to consider:
By addressing these questions, organizations can better prepare for potential disruptions and ensure they are not overly reliant on the perceived improbability of "black swan" events.
Been planning to post on this for quite some time - but this morning the issues the world is facing created the impetus for me to put fingers to keyboard.
In case you are not sure as to whats on today (19th July Universal Time AM) - There has been a major cloud related outage at a hyperscaler and looks like anyone who has plugged into those elements on the cloud has had outages / latency issues and the like.