登录查看更多内容

Unraveling the ChatGPT Outage: Insights from OpenAI's Post-Mortem

Zia Tahir

Senior Architect | Engineering Leader - DevSecOps

发布日期: 2025年1月13日

This post-mortem analyzes the root cause of the widespread service disruption that affected all OpenAI services on December 11, 2024.

source - https://status.openai.com/incidents/ctrsv3lwd797

Incident Overview

All OpenAI services experienced a significant outage on December 11, 2024, beginning at 3:16 PM PST and lasting until 7:38 PM PST.

OpenAI operates hundreds of Kubernetes clusters globally. Kubernetes has a control plane responsible for cluster administration and a data plane to serve workloads.?

The outage was attributed to an internal change related to the rollout of a new telemetry service aimed at improving observability across OpenAI's Kubernetes clusters. This deployment inadvertently overwhelmed the Kubernetes control plane, leading to cascading failures across critical systems. At 3:12 PM PST, new telemetry service was deployed to collect detailed Kubernetes control plane metrics.

Root Causes

1. The configuration of the new telemetry service triggered simultaneous, resource-heavy Kubernetes API operations across thousands of nodes. This unexpected surge in demand overloaded the Kubernetes API servers, causing the control plane to fail in most large clusters.

2. The change was tested in a staging cluster without any issues. However, the impact only occurred in clusters beyond a certain size, which the testing environment did not adequately replicate.

3. DNS caching on each node delayed the detection of failures, allowing the rollout to proceed before the issue's full extent was recognized. Once the 20-minute DNS cache expired, services began to fail due to their dependence on real-time DNS resolution.

Remediation Efforts

Upon identifying the issue, OpenAI initiated multiple workstreams to restore service quickly. These included scaling down cluster sizes to reduce API load, blocking network access to Kubernetes admin APIs to prevent further requests, and scaling up API servers to handle the load. These efforts allowed for the removal of the problematic telemetry service, leading to a gradual recovery of the affected clusters.

Preventive Measures

To prevent similar incidents in the future, OpenAI is implementing several measures-

1.? Robust Phased Rollouts - Improved monitoring and phased rollout processes for infrastructure changes to detect failures early.

2.? Fault Injection Testing: Running tests to ensure the Kubernetes data plane can function independently of the control plane and to identify bad changes.

3.? Emergency Access Protocols: Establishing mechanisms to ensure access to the API server during high-pressure situations.

4.? Decoupling Data and Control Planes: Investing in systems to reduce the dependency on the control plane for service discovery.

5.? Faster Recovery Strategies: Implementing caching and dynamic rate limiters to facilitate quicker recovery process.

要查看或添加评论，请登录

Zia Tahir的更多文章

AI Powered Unit Tests & Sonar Fixes: A Robot Framework Approach for DevOps

2025年3月10日

AI Powered Unit Tests & Sonar Fixes: A Robot Framework Approach for DevOps

Introduction Tired of the endless cycle of writing unit tests and fixing SonarQube issues? In today's fast-paced…
Cattle, Not Pets: Transforming Infrastructure for Modern DevOps

2025年1月1日

Cattle, Not Pets: Transforming Infrastructure for Modern DevOps

In traditional IT, servers were often treated like cherished pets. Each server had a unique name, was carefully…

1 条评论
The Great Repository Debate: Mono, Multi, or Micro?

2024年11月27日

The Great Repository Debate: Mono, Multi, or Micro?

The choice of repository structure is a critical decision for software development projects. Mono-repo, multi-repo, and…

1 条评论
DDoS attack on X (formerly Twitter) - Mitigating DDoS Attacks

2024年8月23日

DDoS attack on X (formerly Twitter) - Mitigating DDoS Attacks

Elon Musk on Monday , Aug 12 stated that there was a “massive DDoS attack” on X just before the Spaces interview with…
Key Performance Indicators (KPIs) for DevOps Monitoring

2024年3月28日

Key Performance Indicators (KPIs) for DevOps Monitoring

Key Performance Indicators (KPIs) and metrics play a crucial role in monitoring the effectiveness and productivity of…
CI-CD Pipeline for iOS and Android Apps : A Step-by-Step Guide

2023年12月8日

CI-CD Pipeline for iOS and Android Apps : A Step-by-Step Guide

Implementing CI/CD for iOS/Android can spare you days typically spent on preparing app submissions, addressing critical…
State of DevOps 2023 - report by DORA

2023年10月25日

State of DevOps 2023 - report by DORA

The Accelerate State of DevOps Report 2023 is the ninth annual report from the DevOps Research and Assessment (DORA)…

1 条评论
Demystifying Cloud-Native Security: Exploring the Power of the 4C's

2023年10月3日

Demystifying Cloud-Native Security: Exploring the Power of the 4C's

Security can be conceptualized as a multi-layered strategy. The 4C's of Cloud Native security encompass Cloud…
Pfizer leverages a serverless architecture on AWS to process large amounts of digital biomarker data

2023年9月5日

Pfizer leverages a serverless architecture on AWS to process large amounts of digital biomarker data

The Pfizer Digital Medicine and Translational Imaging team is actively engaged in the development and implementation of…
Decoding Stack Overflow's On-Premises Monolith: A Dive into Infrastructure

2023年8月5日

Decoding Stack Overflow's On-Premises Monolith: A Dive into Infrastructure

Stack Overflow is a popular online forum where programmers can ask and answer questions about programming. It is the…

1 条评论

See all articles

Zia Tahir的更多文章

AI Powered Unit Tests & Sonar Fixes: A Robot Framework Approach for DevOps

Cattle, Not Pets: Transforming Infrastructure for Modern DevOps

The Great Repository Debate: Mono, Multi, or Micro?

DDoS attack on X (formerly Twitter) - Mitigating DDoS Attacks

Key Performance Indicators (KPIs) for DevOps Monitoring

CI-CD Pipeline for iOS and Android Apps : A Step-by-Step Guide

State of DevOps 2023 - report by DORA

Demystifying Cloud-Native Security: Exploring the Power of the 4C's

Pfizer leverages a serverless architecture on AWS to process large amounts of digital biomarker data

Decoding Stack Overflow's On-Premises Monolith: A Dive into Infrastructure