A case of Terabyte scale backup and recovery solution
Chinmay Naik
Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.
You are a tech lead who's tasked to backup 1.5TB (yes, not a typo, a terabyte) of data DAILY to S3.
You're not really sure about any other background about this requirement, so you start asking some questions:
You have many other questions, but you choose to start with these...
You gather answers to these and more:
Context
The company is in the Cybersecurity domain and provides its clients a "Managed Security Services" offering. The data is generated by its multi-tenant SIEM platform deployed on private data centers. The SIEM handles 100s of tenants and produces 1.5 TB of log data daily (tar.gz format). The infra consists of 24 beefy machines in Data Centers (DC). There's a private link between the DC and AWS. Every night, a cron job creates tar.gz files on these 24 machines.
You find out some more details.
The average tar.gz file size per machine is 65GB. (24 machines * ~65GB avg tar.gz file => 1.5TB data daily). Each machine handles a few tenants, and the tar.gz file contains data for these tenants. Currently, tar.gz files are moved to a 200TB archival storage. The current archival process needs fixing. The data needs to be kept for 1 yr for compliance purposes (total data per year could be 700TB, given the growth of the company). There's also a need to re-arrange the data in a specific dir structure before gzipping.
With this, you start designing a high-level solution that will be:
Exploring the Solution
You also understand more about the SIEM and how it stores this huge data.
You perform a POC to find out whether you can hook into the SIEM to create the required dir structure upfront. This will save a lot of disk IO to read the data, transform it, and write it again to the appropriate format.
With some trial and error, POC is successful. ?
Since the data is seldom accessed and only used for compliance purposes, you choose the S3 Glacier Deep Archive storage class. This is the cheapest and most reliable way to store the data at this scale.
The challenge is uploading this huge data from the DC to S3 daily.
You test the available bandwidth by uploading large files to S3. This impacts the production network. Luckily, a backup network pipe can be used without affecting the production traffic. ???
You find out that the 24 VMs are beefy, but they are already serving prod traffic. Running the backup workload causes high CPU utilization and affects prod traffic. So you decide to run the tar.gz operation on a separate central server (in non-prod network)
领英推荐
With these and a few more small POCs in place, you propose this flow for the whole solution.
Solution
Tech-wise, you choose something simple and boring - bash scripts and cron jobs.
There's some pushback about using bash to do all this work, but you're sure it can work well (and be maintainable).
This is your overall solution:
It's just three bash scripts overall, with some error handling and success/failure notifications. You orchestrate the execution of these scripts such that they work like a data pipeline:
All these operations run in parallel as much as possible. And it works beautifully on production for months without much oversight! You encounter some edge cases during your beta testing, but it's nothing you can't handle at this point.?
The entire solution costs less than a few hundred dollars per month to run.
You cross-check your solution goals:
Lessons
I write such stories on software engineering. There's no specific frequency, as I don't make up these. If you liked this one, you might love - https://www.dhirubhai.net/pulse/curious-case-slow-apis-chinmay-naik/
Follow me - Chinmay Naik for more such stuff.
MLX Tech @ Capital One
1 年This reminded me of a talk by Brian Cantrill where he demonstrated the usecase of grep and pipes for a TB scale data. Simple, maintainable, and scalable.
Building Resilient Systems with Cloud and DevOps
1 年Insightful share.
Startup Enthusiast & 0->1 Architect | Turned Vision into Reality with a Successful Acquisition ?? | Featured in Forbes & CNBC Young Turks ?? | TEDx Speaker
1 年Really like the case story.. would love to see some more similar practical case studies. I am surely gonna discuss it with team in our weekly tech talks. Thanks Chinmay Naik
Founder @BetaCraft | Software Craftsman | Ruby on Rails Developer
1 年This made a very nice morning read, thanks!
Software Architect | Lead Engineer | Broadleaf Commerce | B2C | Distributed Systems | Containerization | AWS Cloud
1 年How 10hrs broken down from step 1 to 4 ?