What can telecom provider outages teach us?
System availability

What can telecom provider outages teach us?

Background

We often take our cell phones for granted as the wireless networks that they connect to are built to be resilient and always available. Given their importance to essential services (e.g., public emergency services, etc.), their availability and capacity service levels are heavily regulated. When there is an outage, usually the overseeing regulatory body conducts a formal root cause analysis, and those findings are publicly shared. It is a great read as it highlights some of the best practices that one can adopt when designing and building for resilient enterprise platforms and services (greater than four 9s, which is never trivial).

?For this article, the following reports were referenced:

[Canada] Assessment of Rogers Networks for Resiliency and Reliability Following the 8 July 2022 Outage

Roger July 2022 outage lasted over 16 hours with 12 million customers loosing wireless and wireline services, This included mobile subscribers, home Internet users, corporate customers, and institutional customers that provide critical services (e.g., Interac e-Transfer and electronic payment services).

[United States] February 22, 2024, AT&T Mobility Network Outage

AT&T wireless service outage on February 22, 2024, that lasted at least 12 hours and prevented customers from using voice and data services, including blocking more than 92 million phone calls and more than 25,000 attempts to reach 911

Let’s get into the learnings and takeaways.

Recap of deficiencies

Though dissimilar networks and events resulted in a large-scale failure, they have common themes. Key ones include:

  • Superficial review of configuration changes
  • Lack of adherence to established change management process(es)
  • Insufficient capacity, contributing to cascading failures

The following are more specific to Rogers (vs AT&T) as the failure impacted the IP backbone (versus AT&T experiencing 5G and voice backhaul downtime)

  • Control plane inaccessible due to data plane failure
  • Transmission of telemetry (e.g., logs, event details) failing due to inaccessible data plane ?
  • Communication failure between response and engineering teams due to inaccessible data plane
  • Lack of communication on restoration progress to key stakeholders (e.g., 911 centers)?

Recommendations ?

Overview

Similar themes can be noted in the recommendations – the table below summarizes their potential impact when extrapolated to a global networking environment supporting hybrid cloud deployments.

Recap of recommendations along with some relative sizing across notable dimensions

Verification and Validation (V&V) enablement

Distributed systems are complex, with configuration / software changes often having repercussions beyond the systems being impacted.

Though institutive to have a sandbox / lab environment, it is often challenging to stand up a truly representative environment of production (e.g., third party integrations, complexity and investment hurdles are often impediments).

In addition, traditional code reviews (aka ‘pull requests’) can be challenging when confronted with a complex code base and ‘guardrails’ are not enforced in an automated manner.

A potential approach (for networks to align with the background would be):

Use of network digital twins / network simulators: These are most effective in enterprises leveraging singular network solutions. Aside from implementing the tools’ agents and endpoints, there can be considerable maintenance effort (for it to be a representative twin) – effort that is often missed in the planning and funding phase of a program. Coverage can be lacking (e.g., BGP) which drives the need for other complementary approaches.

Moving away from a monolith production environment: The CrowdStrike event had the pundits advocating a canary deployment to enterprises (though it is more apropos for the tool vendors). Fundamentally, network environments tend to be monolithic ‘production’ grade. For example, segregate wireline from wireless though it does have significant capital implications.

In case of greenfield / new environments, one consideration is to have independent network ‘stack’ along the following lines (or permutations thereof):

·????? non-production vs production

·????? Geographical delineation (e.g., Americas vs EMEA)

·????? Individual cloud service provider vs colo / data center

The implication for having such an approach is the redundancy in tooling (separate control plane for each of the environments, along with the complexity of enabling and maintaining configuration across them). In addition, there would be a need for additional automation / tooling to detect variances / drift over time that may occur due to inconsistent states.

Once in place, it does allow for an incremental roll out that would mitigate the ‘blast’ radius of unintended change.

Change execution

Recommendations include:

  • More than one person doing the ‘peer’ review, with potentially involvement of other SMEs for complex ones.
  • Leverage automation where feasible to allow expedient roll back. Some of systems do not have APIs so there may be a need to invest in building an interface on what has traditionally been a session on the router.
  • Have sufficient system capacity for what may be a full system restore as a byproduct of a roll-back / fallback. This may require full restore, in which case serverless or on-demand modes may not scale up in an effectively and expediently.

Proactive monitoring of critical constraints / parameters

Monitoring of key network states and configurations may be conflated with observability. This results in less than robust coverage in large enterprises.

For example, route aggregation is key when using WAN constructs such as AWS DirectConnect – if the advertised routes exceed 100, the BGP state goes down. The challenge is the parameters such as these require custom utilities to do trend analysis and notify if there is a risk of such an event occurring?

In a large organization, the observability teams lack knowledge and expertise on what is a critical configuration; the networking teams lack development expertise, and the platform teams are not engaged in the resiliency discussion.

Organizational boundaries aside, pragmatically, assigning it to the platform team would ensure that these items are instrumented in a consistent manner across the various components through the OSI layer.

Segregation of control plane traffic from data plane

Control plane typically includes all the API calls, UI traffic, authentications related to the cloud service providers as well as management tools needed for sound operations (e.g., inhouse automation, deployment and core services).

A key recommendation is to have segregate control plane flow from data plane, using separate, redundant connectivity.

In addition, the segregation should include telemetry (e.g., metrics, logs) data. Though it requires upfront engineering design and enablement, it mitigates scenarios where data plane downtime can impact operational visibility and ability to update configurations.

Redundant communication channels

Access to communication tools (e.g., Microsoft Teams, Slack) may be impacted in an outage impacting data plane. This would impact the ability for the responders to interact and engage others during the recovery process.

It is prudent to have a redundancy in tooling, with some of them being able to leverage infrastructure supporting control plane data flows. To mitigate third party failure, one can leverage self-hosted options for which there several opensource and commercial options available.

Finally, to reduce reliance on cloud storage for business continuity plans, it may be helpful to automatically ‘publish’ documents to open formats (e.g., PDF/A, Markdown) hosted on distributed, secure web farms accessible by target audience.

#IToutage #ITresiliency #continuousavailability??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了