What can telecom provider outages teach us?
Nauman Noor
Public Cloud Engineering Leader | IT Strategy | Infrastructure | Lakehouse, Gen AI | GRC
Background
We often take our cell phones for granted as the wireless networks that they connect to are built to be resilient and always available. Given their importance to essential services (e.g., public emergency services, etc.), their availability and capacity service levels are heavily regulated. When there is an outage, usually the overseeing regulatory body conducts a formal root cause analysis, and those findings are publicly shared. It is a great read as it highlights some of the best practices that one can adopt when designing and building for resilient enterprise platforms and services (greater than four 9s, which is never trivial).
?For this article, the following reports were referenced:
[Canada] Assessment of Rogers Networks for Resiliency and Reliability Following the 8 July 2022 Outage
Roger July 2022 outage lasted over 16 hours with 12 million customers loosing wireless and wireline services, This included mobile subscribers, home Internet users, corporate customers, and institutional customers that provide critical services (e.g., Interac e-Transfer and electronic payment services).
[United States] February 22, 2024, AT&T Mobility Network Outage
AT&T wireless service outage on February 22, 2024, that lasted at least 12 hours and prevented customers from using voice and data services, including blocking more than 92 million phone calls and more than 25,000 attempts to reach 911
Let’s get into the learnings and takeaways.
Recap of deficiencies
Though dissimilar networks and events resulted in a large-scale failure, they have common themes. Key ones include:
The following are more specific to Rogers (vs AT&T) as the failure impacted the IP backbone (versus AT&T experiencing 5G and voice backhaul downtime)
Recommendations ?
Overview
Similar themes can be noted in the recommendations – the table below summarizes their potential impact when extrapolated to a global networking environment supporting hybrid cloud deployments.
Verification and Validation (V&V) enablement
Distributed systems are complex, with configuration / software changes often having repercussions beyond the systems being impacted.
Though institutive to have a sandbox / lab environment, it is often challenging to stand up a truly representative environment of production (e.g., third party integrations, complexity and investment hurdles are often impediments).
In addition, traditional code reviews (aka ‘pull requests’) can be challenging when confronted with a complex code base and ‘guardrails’ are not enforced in an automated manner.
A potential approach (for networks to align with the background would be):
Use of network digital twins / network simulators: These are most effective in enterprises leveraging singular network solutions. Aside from implementing the tools’ agents and endpoints, there can be considerable maintenance effort (for it to be a representative twin) – effort that is often missed in the planning and funding phase of a program. Coverage can be lacking (e.g., BGP) which drives the need for other complementary approaches.
Moving away from a monolith production environment: The CrowdStrike event had the pundits advocating a canary deployment to enterprises (though it is more apropos for the tool vendors). Fundamentally, network environments tend to be monolithic ‘production’ grade. For example, segregate wireline from wireless though it does have significant capital implications.
领英推荐
In case of greenfield / new environments, one consideration is to have independent network ‘stack’ along the following lines (or permutations thereof):
·????? non-production vs production
·????? Geographical delineation (e.g., Americas vs EMEA)
·????? Individual cloud service provider vs colo / data center
The implication for having such an approach is the redundancy in tooling (separate control plane for each of the environments, along with the complexity of enabling and maintaining configuration across them). In addition, there would be a need for additional automation / tooling to detect variances / drift over time that may occur due to inconsistent states.
Once in place, it does allow for an incremental roll out that would mitigate the ‘blast’ radius of unintended change.
Change execution
Recommendations include:
Proactive monitoring of critical constraints / parameters
Monitoring of key network states and configurations may be conflated with observability. This results in less than robust coverage in large enterprises.
For example, route aggregation is key when using WAN constructs such as AWS DirectConnect – if the advertised routes exceed 100, the BGP state goes down. The challenge is the parameters such as these require custom utilities to do trend analysis and notify if there is a risk of such an event occurring?
In a large organization, the observability teams lack knowledge and expertise on what is a critical configuration; the networking teams lack development expertise, and the platform teams are not engaged in the resiliency discussion.
Organizational boundaries aside, pragmatically, assigning it to the platform team would ensure that these items are instrumented in a consistent manner across the various components through the OSI layer.
Segregation of control plane traffic from data plane
Control plane typically includes all the API calls, UI traffic, authentications related to the cloud service providers as well as management tools needed for sound operations (e.g., inhouse automation, deployment and core services).
A key recommendation is to have segregate control plane flow from data plane, using separate, redundant connectivity.
In addition, the segregation should include telemetry (e.g., metrics, logs) data. Though it requires upfront engineering design and enablement, it mitigates scenarios where data plane downtime can impact operational visibility and ability to update configurations.
Redundant communication channels
Access to communication tools (e.g., Microsoft Teams, Slack) may be impacted in an outage impacting data plane. This would impact the ability for the responders to interact and engage others during the recovery process.
It is prudent to have a redundancy in tooling, with some of them being able to leverage infrastructure supporting control plane data flows. To mitigate third party failure, one can leverage self-hosted options for which there several opensource and commercial options available.
Finally, to reduce reliance on cloud storage for business continuity plans, it may be helpful to automatically ‘publish’ documents to open formats (e.g., PDF/A, Markdown) hosted on distributed, secure web farms accessible by target audience.
#IToutage #ITresiliency #continuousavailability??