Containing the Networking 'Blast Radius'
As the world relies more and more on cloud computing for its day-to-day functioning and real-time communications, the availability & consistency of ‘distributed cloud computing’ play a very vital role. Lately, there has been a spate of large-scale catastrophic outages in North America, Hyperscalers, Asia, Europe, and very recently Oceania with far-reaching impacts on society and commerce. Everything that can go wrong went wrong. Due to complex interactions among multiple faults (e.g. TE controller removing routes/going offline, incorrect switch native IGP configs, delayed controller response) the faults snowballed and cascaded uncontrollably. Even after fixing the original root cause, aftereffects continued. The majority of the aforementioned outages can be traced back to networking, with many parallels with removal/misconfigurations in route filters which allowed route leaks to overwhelm internal routers, wrong circuit breaker settings (Maximum-Prefix), and the loss of Control Plane triggering spikes in IGP/BGP routes advertisements. Today there is a lot of industry focus on containing the Blast radius (extent of the impact from a fault), as it is nearly impossible to eliminate all fault scenarios.
?As can be seen on Fig.1 below, the core tenants of Blast Radius minimization in Distributed Cloud Computing are achieved by way of containment. The regions are geographically isolated and maintain a ‘share nothing’ philosophy. Each region is then subdivided into independent availability zones (AZ), which are built in a certain proximity to each other for latency and synchronous replications. Furthermore, cell-based architectures and shuffle shading techniques are used to further contain fault propagation and reduce blast radius. Given the homogeneity in the DCs, implementing these ‘Blast Radius’ containment techniques has been relatively easy and effective. However, the same level of design rigor and operational discipline has not yet been implemented in WAN underlay networks, which are often a collection of heterogeneous networks from multiple service providers and Hyperscalers
Below are some of the WAN initiatives that are being adopted to help contain Blast radius
?WAN ‘Blast Radius’ Containment Initiatives
领英推荐
?The world today runs on distributed cloud computing and the need for high availability under network partition tolerance is paramount (in addition to Scaling, Security, and Sovereignty).
?As we’ve witnessed recently, an outage in ‘cloud computing’ going to not just inconvenience you from access to your favorite social media sites, it can disrupt your basic phone lines, paralyze the transport and location services, and disrupt the communications to the emergency services. As outlined above, WAN underlays that inter-connect distributed cloud computing need more improvements in terms of design and architecture in containing the Blast Radius of ‘distributed cloud computing.’
?Ryan Perera
Opinions expressed in this article are the author's own
Pre-sales/Sales/Business Development & Technology Management Professional - Telcos, Enterprise and Government Accounts, Ex-Huawei I Ex-RCOM (RTIC) I Ex-ITI Ltd.
1 年Nicely summarized... Security policies like zero trust along with real-time monitoring can improve the cloud security; which can limit the "Blast Radius" of a data breach....
Technology enthusiast & passionate about cyber security
1 年A very important consideration and from my perspective, crucial to highlight. Social media and the like are the consumer side of the cloud and these are minor inconveniences. However, the disruptions and the impact aside from consumer inconveniences, logistics, revenue etc may also have deep cybersecurity impacts, with many less secure organisations relying on cloud only products that lack failover. Strong points on containment and managing the risk, it highlights the need for strong network providers who innovate in this field to support their customers. Thanks for sharing.
Transforming Telecom Networks, Businesses and Organizations
1 年Well said Ryan. Cost savings need to be balanced with network safety! Given the far reaching impact of failures, containing blast radius is a necessary part of design considerations
Business Development // Network Architect // Sales and Marketing // Technologist // Negotiations // Technical Sales
1 年Good Read ?? "Blast Radius"
IT executive specializing in Hyperscaler Space| Power | Connectivity solutions.
1 年Great article Ryan!