登录查看更多内容

Single Points of Failure

Kenneth Igiri

Enterprise Architect | Enabling Long-Term Business-Tech Alignment with Architecture & Strategy Tools

发布日期: 2024年3月17日

This past week was one of those week's I have had a certain theme repeatedly staring me in the face as a rising professional. That theme is redundancy! The objective of redundancy in system design is typically to achieve high availability and/or disaster recovery.

The first challenge this week that brought these terms to the fore was the choice of availability zones versus regions as mechanisms to address availability concerns. The second was what I may term "boundaries of control" in the quest to ensure systems are "up". To be more specific, the widespread unavailability of Internet access in some African countries.

Do We Really Need Multiple Regions?

In a conversation with respected service providers I had argued that deploying two regions for cloud infrastruture on Amazon Web Services (AWS) or Cloud9 Community Channel should start with the conversation of what availability requriements we are are trying to achieve. In essense my position is, when an entire CSP region fails, it makes the news because it doesn't happen often enough for it to be a grave concern to most organizations. Incuring the cost and embracing the complexity associated with setting up alternate regions in such a way that it actually works should be reserved for the most stringent availability requirements.

Availability Requirements

This argument is another stab about establishing first what it is we are trying to achieve. 99.9%? 99.99? 99.999? A respected senior professional in the data management space, Brent Ozar once said in a training that when a client wants to achieve high availability, the best approach is to offer options and the costs associated with those options. This forces the requester to think deeply about what they are asking for. I thought then and even now that it makes sense. Start with that is required.

Many IT organizations spend so much money, time and effort architecting and building resilience that is hardly ever used, infrequently tested and often unsuccessful when needed. This may be traceable to the fact that the points of failure are not properly analysed and some causes of failure are simply out of our "boundaries of control". Recall the cloud outages of 2020 and 2021 attributed to DNS and AD.

AZs Might Just Be Sufficient

My opinion about most cloud deployments on the well-known cloud service providers is that using availability zones for redundancy is likely to address the majority of failure scenarios. Why have passive infrastructure in alternate regions for solutions that do not require five nines. I would understand such effort and cost for directory services, payment systems, telemedicine and other mission critical solutions but others may need to be evaluated deeply.

Most Importantly, Does It Work?

We may throw three copies of each component at a solution and still not meet the need from a service perspective. Many IT leaders end up explaining why it didn't work to executives. Bottomline is it may not always be possible to cater for all failure scenarios and we shouldn't promise anyone that we can. Even if we can, should we?

International Trade Council 4 个月前

Countdown to Resilience: Navigating a Cloud Crisis

Gauri Yadav 1 年前

10 Tips for an Effective Infrastructure Monitoring…

VaporVM 3 年前

So, What Really Happened with Africa's Internet?

A couple of days after my confident position in the reliability of the cloud, almost like a response from "principalities and powers", most of us lost Internet access. It is already common knowledge that the failure of internet services in a few African countries was caused by the failure of four undersea fibre optic cables. Jess Auerbach Jahajeeah in her interview with The Conversation U.S. pointed out that while these systems are expected to be built for redundancy, some countries have only one cable terminating on their borders.

This means that a startup in Lagos, Accra or Freetown leveraging cloud will still have an outage from the customer's perspective resulting from this far-reaching failure. It wouldn't matter how many availability zones or regions they have deployed in CSPs whose data centres are located in Europe and America.

A Slight Digression on Root Cause Analysis

A few organizations sent notifications to customers up to three days after this major incident. Users might have spent some time troubleshooting their phones and computers, moving from one network to another or simply complaining to no one in particular before those notifications came. One might question the level of proactive monitoring such organizations are engaged in but I would argue that in many cases it is an issue of problems with architecture governance and documentation.

A monitoring tool will observe an outage, documented architecture at a reasonable level of detail will show how the known incident is related to the observed/observable experience. More sophisticated shops may have alternative approaches to maintaining a useful view of such relationships. Such levels of certainty about the relationships helps service management communicate to customers with clarity.

It is indeed a hard place to get to but I think organizations should start the journey. This old, short article might serve as a thought trigger on the subject of documenting architecture.

Undersea Fibre Optic Cable Network (Source: Unav.edu)

Other Conversations that Came Up

From questions about whether a ship ran into the said undersea cables to conspiracy theories on neocolonialism to brainstorming on the potential of Low Earth Orbit (LEO) satellites, so many conversations errupted within the past week. We all tend to think about the complexities of our technology-enabled lives when issues occur.

Who would have listened if someone decided they wanted to anticipate potential alternatives to known routes to the Internet? Or engage African governements as a consortium of African CEOs on behalf of Elon Musk? Or drive harder the quest for CSP data centres on African soil? Such futurisitic conversations are often not considered urgent. However, they are very important.

Conclusion

As at the time of this writing, my opinion on using availability zones for more availability scenarios remains the same. Quite a few telcos have advised customers that they are on course in their efforts to restore service by acquiring additional capacity.

It would be interesting to read the reader's opinion. As a result of this recent event, what might be the posture of your organization in terms of reviewing your approach to high availability and investments in redundancy? In what ways can the impact of component failures outside your "boundaries of control" be mitigated?

Work Thoughts

1,576 位关注者

Anne Semadeni-George

?Speaking Skills Coach ?Business English Teacher ?Be truly confident while conducting business in English. ?Leverage your culture for the English you use. ?In business. ?For life.

8 个月

Kenneth, you describe a situation that is unheard of in Switzerland and most of Europe, I dare say...and it was enlightening to see how many challenges are involved with solving this. As always, I find nuggets that apply to my non-IT life and this time it was, "This forces the requester to think deeply about what they are asking for." Yuzzz...

1 次回应

Kehinde Idowu

Lead, Network & Security Specialist at Opentext | CCIE | CISSP | PCCSE l MBA | SDG16

8 个月

Nice thought, Kenneth Igiri. Technologically, we should have matured beyond availability and instead focused on innovation.

1 次回应

Patrick Kingsley Odiegwu

Secretary General: Association of Miners and Processors of Barite (AMABOP)

8 个月

Kenneth Igiri, I had to restart my phone severally, it was such a bizarre moment because the five G-sign was on and off. I called my account officer to know that there were general disruption. I was going to have a business conversation with a contact in US, I have to place a direct call, luckily I had enough credit but everything vanished after over an hour call. Our government need to really take these things more seriously and proactively plan for redundancies and expansion of services.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Single Points of Failure

Kenneth Igiri

Enterprise Architect | Enabling Long-Term Business-Tech Alignment with Architecture & Strategy Tools

Do We Really Need Multiple Regions?

Availability Requirements

AZs Might Just Be Sufficient

Most Importantly, Does It Work?

领英推荐

So, What Really Happened with Africa's Internet?

A Slight Digression on Root Cause Analysis

Other Conversations that Came Up

Conclusion

Work Thoughts

1,576 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

A Strong Foundation Shaken by Microsoft’s Repeated Outage: A Frost & Sullivan Perspective

Global IT Outages and Solutions: What You Need to Know

NOCaas... A Primer!

Business Failover Technology You Can't Afford to Ignore

A Strong Foundation Shaken by Microsoft’s Repeated Outage: A Frost & Sullivan Perspective

Building Robust IT Enterprise Solutions for Critical Infrastructure: A Strategic Approach

Navigating the Cloud: Reflecting on the World's 10 Biggest Outages and Critical Preventive Measures

Monitoring to observability: The shift and its impact on IT infrastructure

How IT Consultants Can Help Optimize IT Infrastructure

Disaster Readiness - Level COVID

Do We Really Need Multiple Regions?

Availability Requirements

AZs Might Just Be Sufficient

Most Importantly, Does It Work?

领英推荐

So, What Really Happened with Africa's Internet?

A Slight Digression on Root Cause Analysis

Other Conversations that Came Up

Conclusion

Work Thoughts

1,576 位关注者

Better Outcomes by Putting People First

2024年11月16日

Authenticity in Transformation: Facing Your Reality

2024年11月3日

How My Last Short Holiday Went

2024年10月26日

Between Cooking Gas and a Cooked Meal

2024年10月12日

Why Should We Hire You?

2024年10月6日

Architecture Review Boards

2024年9月22日

Are You Single?

2024年9月7日

Succession: A Sign of Success

2024年9月2日

Questions We Could Answer with a BCM

2024年8月17日

Communicating Your Craft

2024年8月5日

社区洞察

其他会员也浏览了

A Strong Foundation Shaken by Microsoft’s Repeated Outage: A Frost & Sullivan Perspective

Global IT Outages and Solutions: What You Need to Know

NOCaas... A Primer!

Business Failover Technology You Can't Afford to Ignore

A Strong Foundation Shaken by Microsoft’s Repeated Outage: A Frost & Sullivan Perspective

Building Robust IT Enterprise Solutions for Critical Infrastructure: A Strategic Approach

Navigating the Cloud: Reflecting on the World's 10 Biggest Outages and Critical Preventive Measures

Monitoring to observability: The shift and its impact on IT infrastructure

How IT Consultants Can Help Optimize IT Infrastructure

Disaster Readiness - Level COVID