Enhancing the Availability and Reliability of ISP Network Infrastructures

Enhancing the Availability and Reliability of ISP Network Infrastructures

Introduction

This article presents relevant thoughts about the correct sizing of one of the most primary "functional towers" of an ISP's networking infrastructure. Among the leading key business performance indicators (KPI) of an ISP, two are viewed as primaries: "Performance" and "Availability." The reasons for these determinations are pretty simple, as we analyze the effects of each case as follows:

  • What good is a high-speed Internet connection, but whose service is unavailable to the customer reasonably or very often?
  • What good is an Internet service that is very available or with an unprecedented uptime or availability rate if its performance is poor and delivers content experience under the desired SLA (low throughput, E2E latency, jitter, or packet loss) for its subscribers?

As we can see, these two indicators are the basic ones in networking infrastructures, as subscribers/customers can easily perceive those when using data, video, or voice services.

This article focuses more closely on the relationship between the Availability and Reliability indicators ("functional towers" of a technical design). I will comment more closely on other areas such as Performance and others in future articles.

Definition of the Availability Concept and other Peripheral Functional Towers

I mainly treat Availability as a "functional tower," as it is possible to identify, typify, categorize, and merge sets of technical specifications and processes into this concept, leading to much better network uptime altogether. This strategy includes the design of proper physical, electromechanical, and logical specifications (i.e., hardware in redundant configuration; power and cooling requirements; reliability block diagram, clusters of devices, network links, etc.). Then blend it all with software-level systemic approaches that include services, resources, or facilities, such as protocols and the sort, to increase the desired state of the Availability indicator. Improving this indicator means a whole new thing regarding customer satisfaction, competitiveness, and infrastructure costs!

Availability aims to provide the obvious: ideally, whenever a user (customer or subscriber, whatever you prefer to call them) wants to use the contracted product or service, it is available, ready for whatever the interests of that user. On the other hand, whenever the user tries to access something online and it is unavailable, its downtime frequency characterizes it, and we all know what happens next. Networks are not immune to failures, so we need to predict and anticipate these incidents so that service restoration times meet users' expectations and tolerances.

The Availability indicator is affected by combining two other functional towers that participate in the same proposed mission, supporting each other, which satisfies users with their contracted services. These disciplines would be Reliability and Resilience, respectively.

When studying the concepts of computer network reliability, we can identify issues such as manufacturing quality of networking gear, and the presence or lack of specialized technologies, both physical and logical, in addition to other mechanisms, peripheral resources, and processes that participate in aggregating the intended redundancy + reliability + resiliency = availability set. In my personal view, the reliability of a network by itself is also an indicator of a functional tower of its own. Still, it adds up positively to the overall (and desired) state of network availability.

Resilience, in turn, is related to how a device and the network as a whole react in situations where infrastructure failures (link, devices) occur, whether these failures are equipment components or incidents of logical context.

I particularly like to treat these three as follows: the intended beacon indicator is Availability, which can be calculated and improved by sets of technological specifications derived from the principles of Reliability and Resilience.

The Challenges of Providers in the Question of Network Availability

Internet Service Providers (ISP) need to understand the fundamentals of redundancy + reliability + resilience = availability with absolute clarity so that their infrastructures can be modified to meet or exceed their customers' expectations and desired service level aggreements. Among the many challenges, we can list some situations or truths on the subject:

  1. Device availability is not directly related to the overall availability of a network. Those two are different things!
  2. The availability of a device and its due redundancy are often in conflict, as some "simpler" devices may be more reliable. In contrast, some more reliable devices tend not to be simple to deploy and integrate!
  3. High costs are an eternal question and need to be well balanced.

How much availability does your network infrastructure need, and how much are you willing to pay for it?

One thing that may not seem obvious to many individuals and companies: way too much redundancy can be terrible because, in addition to significantly increasing the costs of the project and the infrastructure as a whole, it makes the logical functions of the network equally way too complex. Think about it! And it can even become a problem to your intended network uptime and operational management goals.

Perhaps one of the biggest challenges here is designing a redundant, reliable, and resilient infrastructure with the desired/ideal Availability indicator or state. The choice of quantity or quality (of redundancy) in a network cannot be treated as "how do you like your steak done?" (rare, medium, well-done), analogies here; that is, it is not exactly a matter of personal choice. Infrastructure projects aiming at better availability need to have confident and ideal physical and logical redundancy standards, which cannot be too scarce or excessive. The costs of adopting these approaches must be understood and compatible with the business missions and strived outcomes. And the same rules must apply to the financial reality of the network operator (you or your company).

Here's my first tip:

Determine DOWNTIME COSTS first, then determine and balance the Availability costs. It will be easier for you to accept the harsh reality of the investments required when you clearly understand the business impacts of a failure, whether it's a simple low-spectrum annoyance failure or a catastrophe on your network.

  • How much are you willing to lose financially with a failure in your network?
  • How much are you willing to lose in terms of customer base, market share, reputation, and the like due to disasters in your network?
  • How much are you willing to invest to properly mitigate the many risks that can cause trauma to your business, from small but inconvenient unavailability to major headaches with massive failures and impacts?

No alt text provided for this image

Practice precisely the above three questions before you even try to design your next infrastructure project!

Matching Downtime Costs versus Availability Costs

Above all, seek to identify and quantify the following impacts on your business.

Immediate impacts:

  • Loss of revenue
  • Unexpected and undesired corrective maintenance or repair costs
  • Contracted SLA penalties
  • Customer dissatisfaction
  • Delays in internal and external projects
  • Negative distractions in business

Long-term impacts:

  • Damage to the company/ISP reputation
  • Customer churn; subscriber evasion
  • Undesirable "favoring" to direct competitors
  • Legal actions against your business
  • Loss of trust, both from ISP employees/collaborators, the market, and customers

Check out the unfolding of this story in the full version of this article, available on the Wiki do Brasil Peering Forum (BPF), written in Brazilian Portuguese:

https://wiki.brasilpeeringforum.org/w/Aprimorando_a_Disponibilidade_da_rede_do_ISP

In this full version of the article, I present some critical fundamentals about MTBF, MTTR, MDT, concepts of parallel and serial physical and logical redundancy, technological facilities (protocols, services), and many ideas related to this subject. Ultimately, where all this falls in and affects or adds positively to the availability metrics.

Let me know your thoughts about this subject!

Until next time!

Leonardo Furtado

Hiracelmo Neto

Network Engineer

2 年

O Mestre

José Carlos Borges de Couto

Business Development and Management | TI + Sales | Prospecting + Pre + Sales + After | Partnerships + Teams | Outsourcing + Services + Consultancy | PUCCAMP + FGV + LABDATA/FIA/USP + IDESP/DARYUS + UNICAMP |

3 年

Very good!!!

Alexandre Silva N.

IT Network Analyst and Consultant / Consultor e Analista de Redes e TIC

3 年

Leo, esse material é OURO PURO! Obrigado por compartilhar!

Aislam Souza

Network&Telecommunications Analyst and CyberSecurity | CCNP 300-410 | CCNA 200-301 | HCIA R&S | AWS Cloud Practitioner | ITIL4 | BIG-IP F5 | DWDM | GPON | NSEs | Autist Father

3 年

Very good.

要查看或添加评论,请登录

Leonardo Furtado的更多文章

社区洞察

其他会员也浏览了