Single Points of Failure
Kenneth Igiri
Enterprise Architect | Enabling Long-Term Business-Tech Alignment with Architecture & Strategy Tools
This past week was one of those week's I have had a certain theme repeatedly staring me in the face as a rising professional. That theme is redundancy! The objective of redundancy in system design is typically to achieve high availability and/or disaster recovery.
The first challenge this week that brought these terms to the fore was the choice of availability zones versus regions as mechanisms to address availability concerns. The second was what I may term "boundaries of control" in the quest to ensure systems are "up". To be more specific, the widespread unavailability of Internet access in some African countries.
Do We Really Need Multiple Regions?
In a conversation with respected service providers I had argued that deploying two regions for cloud infrastruture on Amazon Web Services (AWS) or Cloud9 Community Channel should start with the conversation of what availability requriements we are are trying to achieve. In essense my position is, when an entire CSP region fails, it makes the news because it doesn't happen often enough for it to be a grave concern to most organizations. Incuring the cost and embracing the complexity associated with setting up alternate regions in such a way that it actually works should be reserved for the most stringent availability requirements.
Availability Requirements
This argument is another stab about establishing first what it is we are trying to achieve. 99.9%? 99.99? 99.999? A respected senior professional in the data management space, Brent Ozar once said in a training that when a client wants to achieve high availability, the best approach is to offer options and the costs associated with those options. This forces the requester to think deeply about what they are asking for. I thought then and even now that it makes sense. Start with that is required.
Many IT organizations spend so much money, time and effort architecting and building resilience that is hardly ever used, infrequently tested and often unsuccessful when needed. This may be traceable to the fact that the points of failure are not properly analysed and some causes of failure are simply out of our "boundaries of control". Recall the cloud outages of 2020 and 2021 attributed to DNS and AD.
AZs Might Just Be Sufficient
My opinion about most cloud deployments on the well-known cloud service providers is that using availability zones for redundancy is likely to address the majority of failure scenarios. Why have passive infrastructure in alternate regions for solutions that do not require five nines. I would understand such effort and cost for directory services, payment systems, telemedicine and other mission critical solutions but others may need to be evaluated deeply.
Most Importantly, Does It Work?
We may throw three copies of each component at a solution and still not meet the need from a service perspective. Many IT leaders end up explaining why it didn't work to executives. Bottomline is it may not always be possible to cater for all failure scenarios and we shouldn't promise anyone that we can. Even if we can, should we?
领英推荐
So, What Really Happened with Africa's Internet?
A couple of days after my confident position in the reliability of the cloud, almost like a response from "principalities and powers", most of us lost Internet access. It is already common knowledge that the failure of internet services in a few African countries was caused by the failure of four undersea fibre optic cables. Jess Auerbach Jahajeeah in her interview with The Conversation U.S. pointed out that while these systems are expected to be built for redundancy, some countries have only one cable terminating on their borders.
This means that a startup in Lagos, Accra or Freetown leveraging cloud will still have an outage from the customer's perspective resulting from this far-reaching failure. It wouldn't matter how many availability zones or regions they have deployed in CSPs whose data centres are located in Europe and America.
A Slight Digression on Root Cause Analysis
A few organizations sent notifications to customers up to three days after this major incident. Users might have spent some time troubleshooting their phones and computers, moving from one network to another or simply complaining to no one in particular before those notifications came. One might question the level of proactive monitoring such organizations are engaged in but I would argue that in many cases it is an issue of problems with architecture governance and documentation.
A monitoring tool will observe an outage, documented architecture at a reasonable level of detail will show how the known incident is related to the observed/observable experience. More sophisticated shops may have alternative approaches to maintaining a useful view of such relationships. Such levels of certainty about the relationships helps service management communicate to customers with clarity.
It is indeed a hard place to get to but I think organizations should start the journey. This old, short article might serve as a thought trigger on the subject of documenting architecture.
Other Conversations that Came Up
From questions about whether a ship ran into the said undersea cables to conspiracy theories on neocolonialism to brainstorming on the potential of Low Earth Orbit (LEO) satellites, so many conversations errupted within the past week. We all tend to think about the complexities of our technology-enabled lives when issues occur.
Who would have listened if someone decided they wanted to anticipate potential alternatives to known routes to the Internet? Or engage African governements as a consortium of African CEOs on behalf of Elon Musk? Or drive harder the quest for CSP data centres on African soil? Such futurisitic conversations are often not considered urgent. However, they are very important.
Conclusion
As at the time of this writing, my opinion on using availability zones for more availability scenarios remains the same. Quite a few telcos have advised customers that they are on course in their efforts to restore service by acquiring additional capacity.
It would be interesting to read the reader's opinion. As a result of this recent event, what might be the posture of your organization in terms of reviewing your approach to high availability and investments in redundancy? In what ways can the impact of component failures outside your "boundaries of control" be mitigated?
?Speaking Skills Coach ?Business English Teacher ?Be truly confident while conducting business in English. ?Leverage your culture for the English you use. ?In business. ?For life.
8 个月Kenneth, you describe a situation that is unheard of in Switzerland and most of Europe, I dare say...and it was enlightening to see how many challenges are involved with solving this. As always, I find nuggets that apply to my non-IT life and this time it was, "This forces the requester to think deeply about what they are asking for." Yuzzz...
Lead, Network & Security Specialist at Opentext | CCIE | CISSP | PCCSE l MBA | SDG16
8 个月Nice thought, Kenneth Igiri. Technologically, we should have matured beyond availability and instead focused on innovation.
Secretary General: Association of Miners and Processors of Barite (AMABOP)
8 个月Kenneth Igiri, I had to restart my phone severally, it was such a bizarre moment because the five G-sign was on and off. I called my account officer to know that there were general disruption. I was going to have a business conversation with a contact in US, I have to place a direct call, luckily I had enough credit but everything vanished after over an hour call. Our government need to really take these things more seriously and proactively plan for redundancies and expansion of services.