Podcasting Internet Failures
David Owczarek
Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud
About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major outages. As I started researching the idea, I came across The Internet Report (Apple Podcasts, Spotify, SoundCloud, YouTube) and I’ve been listening for the last six months or so. This podcast is hosted by Barry Collins and Mike Hicks on the ThousandEyes team over at Cisco with an episode every two weeks. ThousandEyes also provides evidence and analysis for recent outages on their blog.
ThousandEye is a pretty interesting concept. They are able to use a massive amount of global telemetry to detect issues and impacts from incidents all over the world. Often, these issues are at the internet service provider (ISP) or cloud service provider (CSP) level. However, even when the failure is deep within a particular company's SaaS infrastructure, it can generate symptomatic telemetry that ThousandEyes can detect. They’ve even rolled this up into a global real-time outage map.
For the record, I have no relationship with ThousandEyes; I just think it’s a great idea in a space that I’ve spent a lot of time thinking about.
One thing I really like about the podcast is the opportunity to learn a new way of looking at things. For example, in a recent episode, Mike Hicks talked about the idea of outages spawning multiple, interacting patterns. “Patterns over patterns” is how he referred to it. (Tip: don’t Google that unless you are really into fashion.) But the idea is that a pattern at one level, say, a network routing issue, can spawn a pattern at a different level. Understanding how these patterns interact can lead to much better insight. The ultimate goal is to be able to attribute every behavior to a part of the system. More formally, I think of it as the response of that system to the underlying, antecedent causes.
领英推荐
One of the most interesting recent insights is that if you look at outages for ISPs and CSPs, the percentage of CSP outages versus the total has been growing at an alarming rate, from 12% or so back in 2022 to almost 40% in the first half of 2024. Source: Cloud Outages Rise & Other H1 2024 Internet Outage Trends. To be fair, not all of the outages tracked produce customer impact, and even when they do, the impact is often limited by other factors.
I have a few mild criticisms. Probably the only one that is worth mentioning here is that they can get fairly speculative when talking about specific incidents. They may know that there were DNS issues, gateway problems, or other symptomatic telemetry, but once the problem context gets down into the application, their visibility is more limited. This leads to some conjecture about what the underlying root cause could be.
It would be fantastic if Barry and Mike could bring in engineering leaders from some of the companies that had outages during the prior interval and are willing to talk about them. That would make for some super compelling content. Along those lines, it occurs to me that I wonder what kind of insights we’d get if the Internet Report and the Void Report did some collaborative research together. I’d be happy to help!