登录查看更多内容

Podcasting Internet Failures

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

发布日期: 2024年6月20日

About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major outages. As I started researching the idea, I came across The Internet Report (Apple Podcasts, Spotify, SoundCloud, YouTube) and I’ve been listening for the last six months or so. This podcast is hosted by Barry Collins and Mike Hicks on the ThousandEyes team over at Cisco with an episode every two weeks. ThousandEyes also provides evidence and analysis for recent outages on their blog.

ThousandEye is a pretty interesting concept. They are able to use a massive amount of global telemetry to detect issues and impacts from incidents all over the world. Often, these issues are at the internet service provider (ISP) or cloud service provider (CSP) level. However, even when the failure is deep within a particular company's SaaS infrastructure, it can generate symptomatic telemetry that ThousandEyes can detect. They’ve even rolled this up into a global real-time outage map.

For the record, I have no relationship with ThousandEyes; I just think it’s a great idea in a space that I’ve spent a lot of time thinking about.

One thing I really like about the podcast is the opportunity to learn a new way of looking at things. For example, in a recent episode, Mike Hicks talked about the idea of outages spawning multiple, interacting patterns. “Patterns over patterns” is how he referred to it. (Tip: don’t Google that unless you are really into fashion.) But the idea is that a pattern at one level, say, a network routing issue, can spawn a pattern at a different level. Understanding how these patterns interact can lead to much better insight. The ultimate goal is to be able to attribute every behavior to a part of the system. More formally, I think of it as the response of that system to the underlying, antecedent causes.

领英推荐

30 Podcasting Pet Peeves (and Their Solutions)

David Spark 9 个月前

Accendo Weekly Update #376

Fred Schenkelberg 2 年前

Accendo Weekly Update #345

Fred Schenkelberg 2 年前

One of the most interesting recent insights is that if you look at outages for ISPs and CSPs, the percentage of CSP outages versus the total has been growing at an alarming rate, from 12% or so back in 2022 to almost 40% in the first half of 2024. Source: Cloud Outages Rise & Other H1 2024 Internet Outage Trends. To be fair, not all of the outages tracked produce customer impact, and even when they do, the impact is often limited by other factors.

I have a few mild criticisms. Probably the only one that is worth mentioning here is that they can get fairly speculative when talking about specific incidents. They may know that there were DNS issues, gateway problems, or other symptomatic telemetry, but once the problem context gets down into the application, their visibility is more limited. This leads to some conjecture about what the underlying root cause could be.

It would be fantastic if Barry and Mike could bring in engineering leaders from some of the companies that had outages during the prior interval and are willing to talk about them. That would make for some super compelling content. Along those lines, it occurs to me that I wonder what kind of insights we’d get if the Internet Report and the Void Report did some collaborative research together. I’d be happy to help!

要查看或添加评论，请登录

David Owczarek的更多文章

Please Give Me the Power

2024年12月5日

Please Give Me the Power

This is a crossover story. It's about audio engineering, but also about reliability engineering.
4 Ways Performing Is Like Programming

2024年5月28日

4 Ways Performing Is Like Programming

The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…
6 Months and Counting

2024年4月23日

6 Months and Counting

It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.
10 ways to ruin a lightning talk

2024年2月6日

10 ways to ruin a lightning talk

I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

1 条评论
Five Timestamps; Four Metrics

2023年12月6日

Five Timestamps; Four Metrics

Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…
What is SRE really?

2023年11月28日

What is SRE really?

Hint: It’s not always what Google says Last year, I presented at SRECon EMEA on the topic of the biases confronting…

3 条评论
The 2023 State of DevOps?Report

2023年11月8日

The 2023 State of DevOps?Report

Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

1 条评论
The Availability Enigma

2022年7月20日

The Availability Enigma

What’s availability? One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended…
SLOConf 2022 - 8 inspiring talks

2022年5月11日

SLOConf 2022 - 8 inspiring talks

SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…
Two learnings from SRECon?2022

2022年4月5日

Two learnings from SRECon?2022

MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

1 条评论

See all articles

Podcasting Internet Failures

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

领英推荐

David Owczarek的更多文章

社区洞察

其他会员也浏览了

How To Be The Most Popular Guest On Every Podcast

cavnessHR Podcast - A talk with Yannick Rault of Sheetgo

How Will Unlocking Access Control Turn Mere Utilities Into Seamless Experiences?

Episode 10 - January 27, 2025

BBCMag.com Biweekly Newsletter

The Smartest Way to Podcast

Update on the community and recent episodes

How can podcasting help get your business through the Coronavirus?

Glympse CEO, Chris Ruff, Joins Service Council for their inService Podcast

Lithium-ion Battery Fires, Guidance, and Podcasts: Firechief’s February Content Roundup

领英推荐

David Owczarek的更多文章

Please Give Me the Power

4 Ways Performing Is Like Programming

6 Months and Counting

10 ways to ruin a lightning talk

Five Timestamps; Four Metrics

What is SRE really?

The 2023 State of DevOps?Report

The Availability Enigma

SLOConf 2022 - 8 inspiring talks

Two learnings from SRECon?2022

社区洞察

其他会员也浏览了

How To Be The Most Popular Guest On Every Podcast

cavnessHR Podcast - A talk with Yannick Rault of Sheetgo

How Will Unlocking Access Control Turn Mere Utilities Into Seamless Experiences?

Episode 10 - January 27, 2025

BBCMag.com Biweekly Newsletter

The Smartest Way to Podcast

Update on the community and recent episodes

How can podcasting help get your business through the Coronavirus?

Glympse CEO, Chris Ruff, Joins Service Council for their inService Podcast

Lithium-ion Battery Fires, Guidance, and Podcasts: Firechief’s February Content Roundup