登录查看更多内容

An engineering deep-dive into Atlassian's Mega Outage of April 2022

Arpit Bhayani

发布日期: 2022年5月21日

In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. Let's dissect the outage and understand its nuances of it.

Disclaimer: I do not have any insider information and the views are pure speculation.

Insight 1: Data loss up to 5 minutes

Because some customers reported a data loss of up to 5 minutes before the incident, it shows that the persistent backup incrementally every 5 minutes. The backup typically happens through Change Data Capture which operates right over the database.

Insight 2: Rolling release of products

Atlassian rolls out features to a subset of the users and then incrementally rolls them to others. This strategy of incremental rollout gives companies and teams a chance to test the waters on a subset and then roll out to the rest.

Insight 3: Mark vs Permanent Deletion

The script that Atlassian ran to delete had both the options- Mark of Deletion and Permanent deletion.

Mark for deletion: is soft delete i.e. marking is_deleted to true. Permanent deletion: hard delete i.e. firing DELETE query

Why do companies need permanent deletion? for compliance because GDPR gives users a Right to be Forgotten

Insight 4: Synchronous Replication

To maintain high availability they have synchronous standby replicas which means that the writes happening needs to succeed on both the databases before it is acknowledged back to the user. This ensures that the data is crash-proof.

Insight 5: Immutable Backups

The backup is made immutable and stored on S3 in some serialized format. This immutable backup allows Atlassian to recover data at any point in time while being super cost-efficient at the same time.

Insight 6: Their architecture is not truly a Multi-tenant Architecture

In a true multi-tenant architecture, every customer gets its fragment of infra- right from DB, to brokers, to servers. But at Atlassian, multiple customers share the same infra components. Companies typically do this to cut down on their infrastructure cost.

Why is it taking a long time to restore?

Because data of multiple customers reside in the same database when the DB was backed up the data (rows and tables) were backed up as is; implying that the backup also had data from multiple customers.

Now to restore the intermingled rows of a customer, the entire backup needs to be loaded into a database and then the rows of specific customers need to be restored. This process is extremely time-consuming.

Here's the video of my explaining this in-depth ?? do check it out

G2Xchange 6 个月前

15 System Design Core Concepts a complete crash course…

Eleke Great 9 个月前

API Governance - Brief Description

Vitaly Scherban 3 个月前

In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. In this video, we will do an engineering deep dive into this outage trying to understand their engineering systems and practices.

We extract 6 key insights into how their engineering systems are built, their backup and restoration strategies, and most importantly why is it taking them so long to recover the data.

Disclaimer: I do not have any insider information about this and the views are pure speculation.

Outline:

00:00 Impact of the outage
03:56 Insight 1: Incremental Backup Strategy
06:38 Why did the Atlassian outage happen?
07:30 Insight 2: Progressive Rollout Strategy
10:57 Insight 3: Soft Deletes vs Hard Deletes
14:28 Insight 4: Synchronous Replication for High Availability
17:47 Insight 5: Immutable backups for point-in-time recovery
21:04 Insight 6: Nearly multi-tenant architecture
23:30 Why is it taking time for Atlassian to recover the deleted data?

Outage Report: https://www.atlassian.com/engineering/april-2022-outage-update

You can also

Subscribe to the YT Channel Asli Engineering
Download the notes
Listen to this on the go on Spotify

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

You can also follow me on your favourite social media LinkedIn, and Twitter.

Yours truly,

Arpit

arpitbhayani.me

Until next time, stay awesome :)

An engineering deep-dive into Atlassian's Mega Outage of April 2022

I teach a course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

become a better engineer
ace your technical discussions
get you acquainted with a massive spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 600+ engineers from 10 different countries and here you can find what they say about the course.

Together, we will build some of the most amazing systems and dissect them to understand the intricate details. You can find the week-by-week curriculum and topics, benefits, testimonials, and other information here https://arpitbhayani.me/masterclass.

Arpit's Newsletter

108,495 位关注者

Devendra Vishwakarma

Software Engineer 4 at Ciena

2 年

I just experienced similar outage with github just now -?https://www.githubstatus.com/ " Investigating?-?We are investigating reports of degraded availability. May?27,?07:36?UTC"

POOJA JAIN

2 年

Awesome???? Arpit Bhayani

1 次回应

查看更多评论

要查看或添加评论，请登录

Arpit Bhayani的更多文章

The best resource does not exist.

2024年9月22日

The best resource does not exist.

This edition of the newsletter contains two quick write-ups about The best resource is mythical Convergent Encryption I…

4 条评论
It's not about what you know, but about how you think

2024年9月8日

It's not about what you know, but about how you think

This edition of the newsletter contains two quick write-ups about It's not about what you know, but about how you think…

1 条评论
Roadmaps are just satisfying your urge to follow a syllabus

2024年8月18日

Roadmaps are just satisfying your urge to follow a syllabus

This edition of the newsletter contains one quick write-up about Roadmaps are just satisfying your urge to follow a…

3 条评论
Always negotiate the offer you get

2024年8月11日

Always negotiate the offer you get

This edition of the newsletter contains two quick write-ups about Always negotiate the offer you get Why PostgreSQL…

2 条评论
Proving your Culture Fit

2024年8月4日

Proving your Culture Fit

This edition of the newsletter contains two quick write-ups about Proving your Culture Fit What not to say during…

1 条评论
Premature Abstractions

2024年7月28日

Premature Abstractions

This edition of the newsletter contains two quick write-ups about Quantify and show impact, whenever and wherever…

1 条评论
Tip the scale in your favor in interviews

2024年7月21日

Tip the scale in your favor in interviews

This edition of the newsletter contains two quick write-ups about How to tip the scale in your favor during interviews…

2 条评论
7 questions that you should ask your interviewer

2024年7月14日

7 questions that you should ask your interviewer

This edition of the newsletter contains two quick write-ups about Questions that you should ask your interviewers The 4…

5 条评论
Traits of a 10x engineer

2024年7月7日

Traits of a 10x engineer

Build your own Interpreter CodeCrafters launched a super interesting challenge on building your own interpreter. Give…

3 条评论
How PostgreSQL stores data in files, called forks

2024年6月30日

How PostgreSQL stores data in files, called forks

Thank you so much for reading this edition of the newsletter ?? If you found it interesting, you will also love my…

1 条评论

See all articles

An engineering deep-dive into Atlassian's Mega Outage of April 2022

Arpit Bhayani

Insight 1: Data loss up to 5 minutes

Insight 2: Rolling release of products

Insight 3: Mark vs Permanent Deletion

Insight 4: Synchronous Replication

Insight 5: Immutable Backups

Insight 6: Their architecture is not truly a Multi-tenant Architecture

Why is it taking a long time to restore?

领英推荐

An engineering deep-dive into Atlassian's Mega Outage of April 2022

Arpit's Newsletter

108,495 位关注者

Arpit Bhayani的更多文章

社区洞察

其他会员也浏览了

Function Structure

Elevating Operations with Retool for Nexify

15 Dynatrace features you won't want to miss

December 02, 2023

Kubernetes Custom Resource and Custom Resource Definition (CRD)

System Design: Busting 6 Myths

Building a high-performance platform – Key points

The three pillars of a data system, with checklists to follow.

Design for Observability - Role of Metrics Ep 2

Insight 1: Data loss up to 5 minutes

Insight 2: Rolling release of products

Insight 3: Mark vs Permanent Deletion

Insight 4: Synchronous Replication

Insight 5: Immutable Backups

Insight 6: Their architecture is not truly a Multi-tenant Architecture

Why is it taking a long time to restore?

领英推荐

An engineering deep-dive into Atlassian's Mega Outage of April 2022

Arpit's Newsletter

108,495 位关注者

Arpit Bhayani的更多文章

The best resource does not exist.

It's not about what you know, but about how you think

Roadmaps are just satisfying your urge to follow a syllabus

Always negotiate the offer you get

Proving your Culture Fit

Premature Abstractions

Tip the scale in your favor in interviews

7 questions that you should ask your interviewer

Traits of a 10x engineer

How PostgreSQL stores data in files, called forks

社区洞察

其他会员也浏览了

Function Structure

Elevating Operations with Retool for Nexify

15 Dynatrace features you won't want to miss

December 02, 2023

Kubernetes Custom Resource and Custom Resource Definition (CRD)

System Design: Busting 6 Myths

Building a high-performance platform – Key points

The three pillars of a data system, with checklists to follow.

Design for Observability - Role of Metrics Ep 2