Containing the Networking 'Blast Radius'

Containing the Networking 'Blast Radius'

As the world relies more and more on cloud computing for its day-to-day functioning and real-time communications, the availability & consistency of ‘distributed cloud computing’ play a very vital role. Lately, there has been a spate of large-scale catastrophic outages in North America, Hyperscalers, Asia, Europe, and very recently Oceania with far-reaching impacts on society and commerce. Everything that can go wrong went wrong. Due to complex interactions among multiple faults (e.g. TE controller removing routes/going offline, incorrect switch native IGP configs, delayed controller response) the faults snowballed and cascaded uncontrollably. Even after fixing the original root cause, aftereffects continued. The majority of the aforementioned outages can be traced back to networking, with many parallels with removal/misconfigurations in route filters which allowed route leaks to overwhelm internal routers, wrong circuit breaker settings (Maximum-Prefix), and the loss of Control Plane triggering spikes in IGP/BGP routes advertisements. Today there is a lot of industry focus on containing the Blast radius (extent of the impact from a fault), as it is nearly impossible to eliminate all fault scenarios.

?As can be seen on Fig.1 below, the core tenants of Blast Radius minimization in Distributed Cloud Computing are achieved by way of containment. The regions are geographically isolated and maintain a ‘share nothing’ philosophy. Each region is then subdivided into independent availability zones (AZ), which are built in a certain proximity to each other for latency and synchronous replications. Furthermore, cell-based architectures and shuffle shading techniques are used to further contain fault propagation and reduce blast radius. Given the homogeneity in the DCs, implementing these ‘Blast Radius’ containment techniques has been relatively easy and effective. However, the same level of design rigor and operational discipline has not yet been implemented in WAN underlay networks, which are often a collection of heterogeneous networks from multiple service providers and Hyperscalers

Fig. 1

Below are some of the WAN initiatives that are being adopted to help contain Blast radius

?WAN ‘Blast Radius’ Containment Initiatives

  • ?Traffic Engineering (TE) control plane isolation: Cloud computing networks moved to centralized global control plane for better network efficiencies and throughput. To contain Blast Radius, the TE domains are now being sliced to sub domains with their own independent sub-controllers (instead of controller replication). Microsoft was one of the early adopters of this strategy with their BlastShield implementation in the WAN and now claims 60% reduction in loss of traffic from controller failures.
  • Network Element (NE) isolation: Core WAN routers now begin to operate at 50-200T fabric capacities, requiring the highest level of availability. WAN router misconfigurations and the loss of control plane have been one of the main causes of recent widescale outages. Adding further optical layer functional loads to the same routers can increase the blast radius exponentially. Although there could be some cost savings from IP-Optical convergence, operators need to strike the right balance between Blast Radius and cost savings.
  • Operations Rigor: There is an increased need for rigorous operations discipline that includes code reviews, testing, staggered deployments, root cause analysis, and also the use of digital twins to stress test deployment procedures and fault scenarios. Furthermore, operational automation will also be key to minimizing human errors.

?The world today runs on distributed cloud computing and the need for high availability under network partition tolerance is paramount (in addition to Scaling, Security, and Sovereignty).

?As we’ve witnessed recently, an outage in ‘cloud computing’ going to not just inconvenience you from access to your favorite social media sites, it can disrupt your basic phone lines, paralyze the transport and location services, and disrupt the communications to the emergency services. As outlined above, WAN underlays that inter-connect distributed cloud computing need more improvements in terms of design and architecture in containing the Blast Radius of ‘distributed cloud computing.’

?Ryan Perera

Opinions expressed in this article are the author's own


Ishwar Chandra

Pre-sales/Sales/Business Development & Technology Management Professional - Telcos, Enterprise and Government Accounts, Ex-Huawei I Ex-RCOM (RTIC) I Ex-ITI Ltd.

1 年

Nicely summarized... Security policies like zero trust along with real-time monitoring can improve the cloud security; which can limit the "Blast Radius" of a data breach....

回复
Jessica M.

Technology enthusiast & passionate about cyber security

1 年

A very important consideration and from my perspective, crucial to highlight. Social media and the like are the consumer side of the cloud and these are minor inconveniences. However, the disruptions and the impact aside from consumer inconveniences, logistics, revenue etc may also have deep cybersecurity impacts, with many less secure organisations relying on cloud only products that lack failover. Strong points on containment and managing the risk, it highlights the need for strong network providers who innovate in this field to support their customers. Thanks for sharing.

回复
Jatinder Pal Singh Sehdev

Transforming Telecom Networks, Businesses and Organizations

1 年

Well said Ryan. Cost savings need to be balanced with network safety! Given the far reaching impact of failures, containing blast radius is a necessary part of design considerations

回复
Manish Goel

Business Development // Network Architect // Sales and Marketing // Technologist // Negotiations // Technical Sales

1 年

Good Read ?? "Blast Radius"

回复
Michael L.

IT executive specializing in Hyperscaler Space| Power | Connectivity solutions.

1 年

Great article Ryan!

回复

要查看或添加评论,请登录

W L Ryan Perera的更多文章

  • The Sweet Spots of Disaggregation in Networking

    The Sweet Spots of Disaggregation in Networking

    Networking, whether it is container, data center (inside), or wide area, plays a vital role in distributed cloud…

  • Improving Connect Monetization by Telcos

    Improving Connect Monetization by Telcos

    Over several decades, Telcos have made significant capital investments in network infrastructure (3G, 4G, and now 5G)…

    3 条评论

社区洞察

其他会员也浏览了