An engineering deep-dive into Atlassian's Mega Outage of April 2022

An engineering deep-dive into Atlassian's Mega Outage of April 2022

In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. Let's dissect the outage and understand its nuances of it.

Disclaimer: I do not have any insider information and the views are pure speculation.

Insight 1: Data loss up to 5 minutes

Because some customers reported a data loss of up to 5 minutes before the incident, it shows that the persistent backup incrementally every 5 minutes. The backup typically happens through Change Data Capture which operates right over the database.

Insight 2: Rolling release of products

Atlassian rolls out features to a subset of the users and then incrementally rolls them to others. This strategy of incremental rollout gives companies and teams a chance to test the waters on a subset and then roll out to the rest.

Insight 3: Mark vs Permanent Deletion

The script that Atlassian ran to delete had both the options- Mark of Deletion and Permanent deletion.

Mark for deletion: is soft delete i.e. marking is_deleted to true. Permanent deletion: hard delete i.e. firing DELETE query

Why do companies need permanent deletion? for compliance because GDPR gives users a Right to be Forgotten

Insight 4: Synchronous Replication

To maintain high availability they have synchronous standby replicas which means that the writes happening needs to succeed on both the databases before it is acknowledged back to the user. This ensures that the data is crash-proof.

Insight 5: Immutable Backups

The backup is made immutable and stored on S3 in some serialized format. This immutable backup allows Atlassian to recover data at any point in time while being super cost-efficient at the same time.

Insight 6: Their architecture is not truly a Multi-tenant Architecture

In a true multi-tenant architecture, every customer gets its fragment of infra- right from DB, to brokers, to servers. But at Atlassian, multiple customers share the same infra components. Companies typically do this to cut down on their infrastructure cost.

Why is it taking a long time to restore?

Because data of multiple customers reside in the same database when the DB was backed up the data (rows and tables) were backed up as is; implying that the backup also had data from multiple customers.

Now to restore the intermingled rows of a customer, the entire backup needs to be loaded into a database and then the rows of specific customers need to be restored. This process is extremely time-consuming.

Here's the video of my explaining this in-depth ?? do check it out

In April 2022, Atlassian suffered a major outage where they "permanently" deleted the data for 400 of their paying cloud customers, and will take them weeks to recover the data. In this video, we will do an engineering deep dive into this outage trying to understand their engineering systems and practices.

We extract 6 key insights into how their engineering systems are built, their backup and restoration strategies, and most importantly why is it taking them so long to recover the data.

Disclaimer: I do not have any insider information about this and the views are pure speculation.

Outline:

  • 00:00 Impact of the outage
  • 03:56 Insight 1: Incremental Backup Strategy
  • 06:38 Why did the Atlassian outage happen?
  • 07:30 Insight 2: Progressive Rollout Strategy
  • 10:57 Insight 3: Soft Deletes vs Hard Deletes
  • 14:28 Insight 4: Synchronous Replication for High Availability
  • 17:47 Insight 5: Immutable backups for point-in-time recovery
  • 21:04 Insight 6: Nearly multi-tenant architecture
  • 23:30 Why is it taking time for Atlassian to recover the deleted data?

Outage Report: https://www.atlassian.com/engineering/april-2022-outage-update

You can also

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

You can also follow me on your favourite social media LinkedIn, and Twitter.

Yours truly,

Arpit

arpitbhayani.me

Until next time, stay awesome :)

No alt text provided for this image

An engineering deep-dive into Atlassian's Mega Outage of April 2022

I teach a course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

  • become a better engineer
  • ace your technical discussions
  • get you acquainted with a massive spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 600+ engineers from 10 different countries and here you can find what they say about the course.

Together, we will build some of the most amazing systems and dissect them to understand the intricate details. You can find the week-by-week curriculum and topics, benefits, testimonials, and other information here https://arpitbhayani.me/masterclass.

Devendra Vishwakarma

Software Engineer 4 at Ciena

2 年

I just experienced similar outage with github just now -?https://www.githubstatus.com/ " Investigating?-?We are investigating reports of degraded availability. May?27,?07:36?UTC"

  • 该图片无替代文字
回复
POOJA JAIN

Storyteller | Linkedin Top Voice 2024 | Senior Data Engineer@ Globant | Linkedin Learning Instructor | 2xGCP & AWS Certified | LICAP'2022

2 年

Awesome???? Arpit Bhayani

要查看或添加评论,请登录

Arpit Bhayani的更多文章

  • The best resource does not exist.

    The best resource does not exist.

    This edition of the newsletter contains two quick write-ups about The best resource is mythical Convergent Encryption I…

    4 条评论
  • It's not about what you know, but about how you think

    It's not about what you know, but about how you think

    This edition of the newsletter contains two quick write-ups about It's not about what you know, but about how you think…

    1 条评论
  • Roadmaps are just satisfying your urge to follow a syllabus

    Roadmaps are just satisfying your urge to follow a syllabus

    This edition of the newsletter contains one quick write-up about Roadmaps are just satisfying your urge to follow a…

    3 条评论
  • Always negotiate the offer you get

    Always negotiate the offer you get

    This edition of the newsletter contains two quick write-ups about Always negotiate the offer you get Why PostgreSQL…

    2 条评论
  • Proving your Culture Fit

    Proving your Culture Fit

    This edition of the newsletter contains two quick write-ups about Proving your Culture Fit What not to say during…

    1 条评论
  • Premature Abstractions

    Premature Abstractions

    This edition of the newsletter contains two quick write-ups about Quantify and show impact, whenever and wherever…

    1 条评论
  • Tip the scale in your favor in interviews

    Tip the scale in your favor in interviews

    This edition of the newsletter contains two quick write-ups about How to tip the scale in your favor during interviews…

    2 条评论
  • 7 questions that you should ask your interviewer

    7 questions that you should ask your interviewer

    This edition of the newsletter contains two quick write-ups about Questions that you should ask your interviewers The 4…

    5 条评论
  • Traits of a 10x engineer

    Traits of a 10x engineer

    Build your own Interpreter CodeCrafters launched a super interesting challenge on building your own interpreter. Give…

    3 条评论
  • How PostgreSQL stores data in files, called forks

    How PostgreSQL stores data in files, called forks

    Thank you so much for reading this edition of the newsletter ?? If you found it interesting, you will also love my…

    1 条评论

社区洞察

其他会员也浏览了