Podcasting Internet Failures
Using AI to generate images with “a thousand eyes” in them produces a lot of creepy stuff

Podcasting Internet Failures

About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major outages. As I started researching the idea, I came across The Internet Report (Apple Podcasts, Spotify, SoundCloud, YouTube) and I’ve been listening for the last six months or so. This podcast is hosted by Barry Collins and Mike Hicks on the ThousandEyes team over at Cisco with an episode every two weeks. ThousandEyes also provides evidence and analysis for recent outages on their blog.

ThousandEye is a pretty interesting concept. They are able to use a massive amount of global telemetry to detect issues and impacts from incidents all over the world. Often, these issues are at the internet service provider (ISP) or cloud service provider (CSP) level. However, even when the failure is deep within a particular company's SaaS infrastructure, it can generate symptomatic telemetry that ThousandEyes can detect. They’ve even rolled this up into a global real-time outage map.

For the record, I have no relationship with ThousandEyes; I just think it’s a great idea in a space that I’ve spent a lot of time thinking about.

One thing I really like about the podcast is the opportunity to learn a new way of looking at things. For example, in a recent episode, Mike Hicks talked about the idea of outages spawning multiple, interacting patterns. “Patterns over patterns” is how he referred to it. (Tip: don’t Google that unless you are really into fashion.) But the idea is that a pattern at one level, say, a network routing issue, can spawn a pattern at a different level. Understanding how these patterns interact can lead to much better insight. The ultimate goal is to be able to attribute every behavior to a part of the system. More formally, I think of it as the response of that system to the underlying, antecedent causes.

One of the most interesting recent insights is that if you look at outages for ISPs and CSPs, the percentage of CSP outages versus the total has been growing at an alarming rate, from 12% or so back in 2022 to almost 40% in the first half of 2024. Source: Cloud Outages Rise & Other H1 2024 Internet Outage Trends. To be fair, not all of the outages tracked produce customer impact, and even when they do, the impact is often limited by other factors.

I have a few mild criticisms. Probably the only one that is worth mentioning here is that they can get fairly speculative when talking about specific incidents. They may know that there were DNS issues, gateway problems, or other symptomatic telemetry, but once the problem context gets down into the application, their visibility is more limited. This leads to some conjecture about what the underlying root cause could be.

It would be fantastic if Barry and Mike could bring in engineering leaders from some of the companies that had outages during the prior interval and are willing to talk about them. That would make for some super compelling content. Along those lines, it occurs to me that I wonder what kind of insights we’d get if the Internet Report and the Void Report did some collaborative research together. I’d be happy to help!

要查看或添加评论,请登录

David Owczarek的更多文章

  • Please Give Me the Power

    Please Give Me the Power

    This is a crossover story. It's about audio engineering, but also about reliability engineering.

  • 4 Ways Performing Is Like Programming

    4 Ways Performing Is Like Programming

    The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…

  • 6 Months and Counting

    6 Months and Counting

    It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.

  • 10 ways to ruin a lightning talk

    10 ways to ruin a lightning talk

    I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

    1 条评论
  • Five Timestamps; Four Metrics

    Five Timestamps; Four Metrics

    Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…

  • What is SRE really?

    What is SRE really?

    Hint: It’s not always what Google says Last year, I presented at SRECon EMEA on the topic of the biases confronting…

    3 条评论
  • The 2023 State of DevOps?Report

    The 2023 State of DevOps?Report

    Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

    1 条评论
  • The Availability Enigma

    The Availability Enigma

    What’s availability? One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended…

  • SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…

  • Two learnings from SRECon?2022

    Two learnings from SRECon?2022

    MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

    1 条评论

社区洞察

其他会员也浏览了