登录查看更多内容

A case of Terabyte scale backup and recovery solution

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

发布日期: 2023年9月5日

+ 关注

You are a tech lead who's tasked to backup 1.5TB (yes, not a typo, a terabyte) of data DAILY to S3.

You're not really sure about any other background about this requirement, so you start asking some questions:

Where's the data produced from and stored currently?
What kind of data (format) is it really? What's the access pattern for this data?
What data retention do we need?
Why do we need a backup in the first place? Why S3? Is that already decided?
Are we already on AWS?
Can someone share some business context with me?
What are data restoration parameters - e.g., time to restore, where to store the restored data?
How often does the backup need to run? What about the restoration process? How often do we need to run it?

You have many other questions, but you choose to start with these...

You gather answers to these and more:

Context

The company is in the Cybersecurity domain and provides its clients a "Managed Security Services" offering. The data is generated by its multi-tenant SIEM platform deployed on private data centers. The SIEM handles 100s of tenants and produces 1.5 TB of log data daily (tar.gz format). The infra consists of 24 beefy machines in Data Centers (DC). There's a private link between the DC and AWS. Every night, a cron job creates tar.gz files on these 24 machines.

You find out some more details.

The average tar.gz file size per machine is 65GB. (24 machines * ~65GB avg tar.gz file => 1.5TB data daily). Each machine handles a few tenants, and the tar.gz file contains data for these tenants. Currently, tar.gz files are moved to a 200TB archival storage. The current archival process needs fixing. The data needs to be kept for 1 yr for compliance purposes (total data per year could be 700TB, given the growth of the company). There's also a need to re-arrange the data in a specific dir structure before gzipping.

With this, you start designing a high-level solution that will be:

automated
secure
performant
operationally simple to run and
budget-friendly

Exploring the Solution

You also understand more about the SIEM and how it stores this huge data.

You perform a POC to find out whether you can hook into the SIEM to create the required dir structure upfront. This will save a lot of disk IO to read the data, transform it, and write it again to the appropriate format.

With some trial and error, POC is successful. ?

Since the data is seldom accessed and only used for compliance purposes, you choose the S3 Glacier Deep Archive storage class. This is the cheapest and most reliable way to store the data at this scale.

The challenge is uploading this huge data from the DC to S3 daily.

You test the available bandwidth by uploading large files to S3. This impacts the production network. Luckily, a backup network pipe can be used without affecting the production traffic. ???

You find out that the 24 VMs are beefy, but they are already serving prod traffic. Running the backup workload causes high CPU utilization and affects prod traffic. So you decide to run the tar.gz operation on a separate central server (in non-prod network)

Devendra Goyal 1 个月前

How Cohesity Revolutionizes Data Backup and Recovery…

Network Techlab (I) Pvt. Ltd 4 个月前

Snapshots Vs Replication Vs Backups

Kuldeep Saxena 3 年前

With these and a few more small POCs in place, you propose this flow for the whole solution.

Solution

create a backup dir structure on each of the 24 VMs via cron job
scp?these files to NFS
create tar.gz files on a central server
upload tar.gz to S3

Tech-wise, you choose something simple and boring - bash scripts and cron jobs.

There's some pushback about using bash to do all this work, but you're sure it can work well (and be maintainable).

This is your overall solution:

It's just three bash scripts overall, with some error handling and success/failure notifications. You orchestrate the execution of these scripts such that they work like a data pipeline:

scp
tar and
s3 upload

All these operations run in parallel as much as possible. And it works beautifully on production for months without much oversight! You encounter some edge cases during your beta testing, but it's nothing you can't handle at this point.?

The entire solution costs less than a few hundred dollars per month to run.

You cross-check your solution goals:

automated ? (bash and cron ftw!)
secure ? (runs in separate VPC)
performant ? (yesterday's data is uploaded within 10 hours, during non-peak time)
operationally simple to run ?
budget-friendly ? (less than a few hundred $ monthly)

Lessons

Simple, boring tech works well on prod
When faced with unknowns, form and validate your hypothesis by doing POCs
Understand existing tech and context as much as you can
Build things iteratively instead of in a big-bang way
Be solution-focused, not tech-focused

I write such stories on software engineering. There's no specific frequency, as I don't make up these. If you liked this one, you might love - https://www.dhirubhai.net/pulse/curious-case-slow-apis-chinmay-naik/

Follow me - Chinmay Naik for more such stuff.

Chinmay Bhoir

MLX Tech @ Capital One

1 年

This reminded me of a talk by Brian Cantrill where he demonstrated the usecase of grep and pipes for a TB scale data. Simple, maintainable, and scalable.

2 次回应

Satish Sutar

Building Resilient Systems with Cloud and DevOps

1 年

Insightful share.

1 次回应

Sumit Jain

Startup Enthusiast & 0->1 Architect | Turned Vision into Reality with a Successful Acquisition ?? | Featured in Forbes & CNBC Young Turks ?? | TEDx Speaker

1 年

Really like the case story.. would love to see some more similar practical case studies. I am surely gonna discuss it with team in our weekly tech talks. Thanks Chinmay Naik

2 次回应

Ratnadeep Deshmane

Founder @BetaCraft | Software Craftsman | Ruby on Rails Developer

1 年

This made a very nice morning read, thanks!

2 次回应

Chandraprakash Sarathe

1 年

How 10hrs broken down from step 1 to 4 ?

2 次回应

查看更多评论

要查看或添加评论，请登录

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

2024年6月4日

Data engineering mystery - rerouting large data in Kafka

You're a tech lead handling a large-scale data pipeline. One day, your colleague (C) pulls you into an issue related to…
Vanishing Acts: The Mystery of Failing Database Writes

2024年3月29日

Vanishing Acts: The Mystery of Failing Database Writes

As an SRE at a growth-stage company, you're on call this week. A PagerDuty alert wakes you up.

16 条评论
A story about a nightmare scenario for every SRE

2024年2月21日

A story about a nightmare scenario for every SRE

Story time. It's about cloud security failures and why good engineering practices matter, especially during the One to…
Curious case of debugging failing webhook API requests

2024年1月9日

Curious case of debugging failing webhook API requests

A short debugging story to start off the new year. Developer (D): Hey, can you join a call? I need some help in…

1 条评论
Migrating Terabytes of metrics data with zero downtime

2023年12月27日

Migrating Terabytes of metrics data with zero downtime

You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min. The CTO wants you to…

2 条评论
A curious story of debugging Machine Learning models

2023年11月29日

A curious story of debugging Machine Learning models

You're woken up by a p90 latency-related alert. This alert is for the main API service, so you start investigating…

3 条评论
Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

2023年11月8日

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

You're a lead SRE and CTO asks you to manage and scale a self-managed 6-node MySQL cluster with 1.5+ TB data on…
Database reliability - zero downtime schema migrations with MySQL

2023年10月24日

Database reliability - zero downtime schema migrations with MySQL

(Database reliability story 1) You join a team as lead SRE, and the CTO asks you to manage and scale a self-managed…

12 条评论
Building Pull Request based ephemeral Preview environments on Kubernetes

2023年10月9日

Building Pull Request based ephemeral Preview environments on Kubernetes

A CTO of a company calls you. They just migrated from Heroku to AWS on EKS.

3 条评论
Taming GCP networking cloud costs

2023年9月26日

Taming GCP networking cloud costs

Here's a story of a pragmatic tech lead who understands networking fundamentals like iptables packet routing and NAT…

3 条评论

See all articles

A case of Terabyte scale backup and recovery solution

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

Context

Exploring the Solution

领英推荐

Solution

Lessons

Chinmay Naik的更多文章

社区洞察

其他会员也浏览了

Snapshots Vs Replication Vs Backups

Navigating the Skies: Multi-Cloud Strategies for Defense and Intelligence

Data Backup and Recovery Market - Forecast(2024 - 2030)

Big Data Security Market: An Attractive Long-Term Opportunity| Microsoft, Oracle, DataVisor

Snapshots Vs Replication Vs Backups

VERITAS-DATA BACKUP IN STYLE

RAID 0 Data Recovery - A Comprehensive Guide

Ensuring Data Security and Compliance with Azure Data Services

May 04, 2022

Storage and Data Protection News for the Month of February 24; Updates from Arcserve, Druva, HYCU & More

Context

Exploring the Solution

领英推荐

Solution

Lessons

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

Vanishing Acts: The Mystery of Failing Database Writes

A story about a nightmare scenario for every SRE

Curious case of debugging failing webhook API requests

Migrating Terabytes of metrics data with zero downtime

A curious story of debugging Machine Learning models

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

Database reliability - zero downtime schema migrations with MySQL

Building Pull Request based ephemeral Preview environments on Kubernetes

Taming GCP networking cloud costs

社区洞察

其他会员也浏览了

Snapshots Vs Replication Vs Backups

Navigating the Skies: Multi-Cloud Strategies for Defense and Intelligence

Data Backup and Recovery Market - Forecast(2024 - 2030)

Big Data Security Market: An Attractive Long-Term Opportunity| Microsoft, Oracle, DataVisor

Snapshots Vs Replication Vs Backups

VERITAS-DATA BACKUP IN STYLE

RAID 0 Data Recovery - A Comprehensive Guide

Ensuring Data Security and Compliance with Azure Data Services

May 04, 2022

Storage and Data Protection News for the Month of February 24; Updates from Arcserve, Druva, HYCU & More