登录查看更多内容

Importance of Reliability in Distributed Systems: Datadog Outage Case Study

Indika W.

SRE, Observability & AIOps Leader | Consultant | International Trainer | Tech Blogger | AWS Community Builder | DevOps Institute Ambassador

发布日期: 2023年3月17日

#Datadog, a monitoring and analytics platform for IT infrastructure, experienced an outage due to the failure of a single node in their distributed data pipeline system. Customers experienced gaps in historical data, elevated latency, and error rates. The issue was resolved by replacing the failed node and backfilling data. The incident emphasized the importance of reliable and fault-tolerant distributed systems and the need for rigorous monitoring and testing to identify potential weaknesses. Future actions include reviewing the system architecture, establishing redundancy in critical components, and regular testing of resilience.

Brief Overview of Datadog:

Datadog is a monitoring and analytics platform that provides visibility into the performance of IT infrastructure, applications, and logs.

Customer Experience during the Outage:

During the outage, customers experienced gaps in their historical data for certain products. They also faced elevated latency and error rates on the web application for metric queries and APM.

Actual Issue:

The actual issue was that the data pipeline that feeds into Datadog's data ingestion systems experienced an outage, leading to a disruption in data processing and delivery.

Root Cause:

The root cause of the issue was the failure of a single node in a distributed data pipeline system. The failure resulted in the node being unable to process data, leading to a backlog of unprocessed data that overwhelmed the rest of the pipeline, causing a cascading failure across the system.

领英推荐

Redundant Data May Be Hurting Both Your Bottom Line…

Perforce Delphix 1 年前

PowerStore Prime: The Ultimate Powerhouse For Your…

Segundo Ramos 8 个月前

Legacy System Upgrades & Data

Ollie McCreesh 4 个月前

How was the issue fixed?

The issue was fixed by identifying the failed node and replacing it with a new one. The new node was able to process data, which helped to clear the backlog and restore the system's normal operation.

High-Level Timeline of Events:

On March 8, 2023, the data pipeline that feeds into Datadog's data ingestion systems experienced an outage.
Customers experienced gaps in their historical data for certain products and elevated latency and error rates on the web application for metric queries and APM.
The root cause of the issue was identified to be the failure of a single node in a distributed data pipeline system.
The failed node was replaced with a new one, and the new node was able to process data, clearing the backlog and restoring the system's normal operation.
Backfilling of data was completed for different products such as Real User Monitoring, Database Monitoring, Network Performance Monitoring, Network Device Monitoring, and Log Management.
Monitoring of the recovery was done, and all Datadog systems were receiving, querying, and evaluating monitors on live data as normal.
Customers still experienced gaps in historical data for parts of the last 24 hours.
Backfilling of data was continued for certain products, and updates were provided every 2 - 3 hours until the backfill effort was completed, and the incident was fully resolved.

Lessons Learned:

The incident highlighted the importance of distributed systems' reliability and fault-tolerance. It also demonstrated the need for monitoring and alerting on the data pipeline's performance and the ability to quickly identify and replace failed nodes.

Potential PoR Actions:

Review the current distributed systems architecture and ensure it is designed with reliability and fault-tolerance in mind.
Establish more rigorous monitoring and alerting on the data pipeline's performance to detect issues early and take corrective action quickly.
Implement redundancy in critical components of the data pipeline to prevent a single point of failure.
Establish a process for regular review and testing of the data pipeline's resilience to identify and address potential weaknesses before they become a problem.
Conduct post-incident reviews to identify further areas of improvement and ensure that lessons learned are incorporated into future incident response plans.

Note: The information presented in this article is based on publicly available sources and does not guarantee its completeness or correctness.

要查看或添加评论，请登录

Indika W.的更多文章

4 Simple Ways to Reclaim Your Time!

2025年1月2日

4 Simple Ways to Reclaim Your Time!

If I ask any of you whether you would like to switch your life with Warren Buffett, I assume most of you might not be…

5 条评论
Three Lessons from My Journey to YouTube: Trust, Energy, and Action ??

2024年9月29日

Three Lessons from My Journey to YouTube: Trust, Energy, and Action ??

Starting any new venture is a learning curve, and my YouTube journey has been no exception. Here are three key insights…

5 条评论
Gremlin vs Chaos Mesh: The Ultimate Chaos Engineering Showdown

2023年2月6日

Gremlin vs Chaos Mesh: The Ultimate Chaos Engineering Showdown

Chaos engineering is the practice of intentionally introducing controlled failures into systems to test their…

1 条评论
15 Dynatrace features you won't want to miss

2023年1月7日

15 Dynatrace features you won't want to miss

Unlock the full potential of Dynatrace with these 15 insider tips and tricks. From features to advanced customization…
Bitcoin – The dawn of Blockchain technology

2022年11月1日

Bitcoin – The dawn of Blockchain technology

First Bitcoin block was created on 3rd January, 2009 implementing the blockchain concept. The first block consists of…
Non Fungible Tokens (NFTs) - Top 35 questions answered

2022年10月14日

Non Fungible Tokens (NFTs) - Top 35 questions answered

NFTs are storming the world with multiple use cases reinventing the digital assets opportunities to millions of people…
Site Reliability Engineering (SRE) – Top 35 questions answered

2022年9月22日

Site Reliability Engineering (SRE) – Top 35 questions answered

Site Reliability Engineering (SRE) is been used across the industries to deliver best in world class service delivered…
Ethereum merger – All you need to know about biggest blockchain migration undergoing now

2022年9月13日

Ethereum merger – All you need to know about biggest blockchain migration undergoing now

#Ethereum is moving from #PoW (Proof of work) to #PoS (Proof of Stake), within few hours Ethereum mainnet will merge…

2 条评论

See all articles

Importance of Reliability in Distributed Systems: Datadog Outage Case Study

Indika W.

SRE, Observability & AIOps Leader | Consultant | International Trainer | Tech Blogger | AWS Community Builder | DevOps Institute Ambassador

领英推荐

Indika W.的更多文章

社区洞察

其他会员也浏览了

RAID 1 & RAID 10

Skyrocketing Business Growth: How Cutting-Edge Data Infrastructure and Enterprise Analytics Are Revolutionizing Data Center Management | Reboot Monkey

Navigating the Skies: Multi-Cloud Strategies for Defense and Intelligence

Prometheus Configuration

Handling Failures in Key-Value Stores: System Design

Data Management | Best Practices

Distributed Systems Design Pattern: Shard Rebalancing — [Telecom Customer Data Distribution Use Case]

Configuring RAID Using Lifecycle Controller: A Comprehensive Guide (2024)

Tackling Complexity in IT: Lessons from the Trenches

Spare your time in knowing about RiteSync!

领英推荐

Indika W.的更多文章

4 Simple Ways to Reclaim Your Time!

Three Lessons from My Journey to YouTube: Trust, Energy, and Action ??

Gremlin vs Chaos Mesh: The Ultimate Chaos Engineering Showdown

15 Dynatrace features you won't want to miss

Bitcoin – The dawn of Blockchain technology

Non Fungible Tokens (NFTs) - Top 35 questions answered

Site Reliability Engineering (SRE) – Top 35 questions answered

Ethereum merger – All you need to know about biggest blockchain migration undergoing now

社区洞察

其他会员也浏览了

RAID 1 & RAID 10

Skyrocketing Business Growth: How Cutting-Edge Data Infrastructure and Enterprise Analytics Are Revolutionizing Data Center Management | Reboot Monkey

Navigating the Skies: Multi-Cloud Strategies for Defense and Intelligence

Prometheus Configuration

Handling Failures in Key-Value Stores: System Design

Data Management | Best Practices

Distributed Systems Design Pattern: Shard Rebalancing — [Telecom Customer Data Distribution Use Case]

Configuring RAID Using Lifecycle Controller: A Comprehensive Guide (2024)

Tackling Complexity in IT: Lessons from the Trenches

Spare your time in knowing about RiteSync!