AWS .. Think Storage .. The Cold .. Glacier
Ahmed Metwally
Cloud Native Architect/ Raising well educated generation of software engineers / AWS UG Egypt leader.
Glacier is Object Storage .. it’s S3 with a twist .. Glacier shares most of S3 attributes .. Same durability, availability, scalability, and elasticity level .. If you didn’t read S3 article in this series "aws .. think storage .. the beginning ..s3" .. Go read it .. then come back to continue.
In Glacier we create vaults .. not buckets .. you know, buckets is easy to access .. if you have bucket of money like the one in the next picture .. you just put your hand in the bucket and grab the money .. but, vaults is another story .. it requires a process to open the vault and get the money out of it.
Glacier is more secure than S3 .. beside all S3 security features .. like server side encryption .. your archive has no URL .. no one can access your archive from internet .. like they do with S3 objects .. it requires IAM user to upload or restore the data .. upload and restore archives via API .. or .. via S3 object lifecycle management .. since Glacier also considered a storage class in S3 .. it’s highly integrated with S3.
Glacier has a unique security feature .. which is .. vault lock policy .. using this feature you can prevent authorized IAM users from doing some operations on the vaults .. even you will be prevented .. if you have to apply some compliance or regulatory requirements .. then vault lock policy is your tool .. let’s say a regulatory requires the data to be retained for one year in an archive .. then you just create a lock policy that prevent deletion on this archive for a year .. then Glacier will give you only 24 hours to validate your policy .. to edit it or abort it .. after 24 hour .. you no longer in control.
To understand why Glacier is exist .. and why it’s not just S3 .. we need to talk about some concepts .. backup .. archive .. RPO .. RTO.
Backup is to take a copy of your important files and store them in safe place and this place allow fast access and retrieval to those files in event of emergency.
Archive means store and forget .. until event occur .. and this event should not occur every short period of time .. archive is long-term storage .. the data that you archive is the data which you will not need to access it for long time .. but it still important data to store in safe place.
When it comes to backup or archive .. you should ask some questions.
First question is .. For how long we can leave the new data without backup it? .. let say your system receive 100 new transactions each one hour .. how many transactions you can afford to lose ? .. if your answer is nothing .. that’s mean you data should be stored and replicated in the same time when receiving it.. If your answer, let’s say, 200 transactions .. that’s mean your data should be backed up every 2 hours .. it’s a trade-off decision.. the cost of replication and backup against the cost of losing the data .. there is no recommendations for this decision .. you the one who will decide .. and the answer of this question called RPO value .. RPO stands for .. recovery point objective.. despite that it’s a point but usually it’s a time interval .. 2 hours .. 4 hours .. something like that.
Second Question is .. what is the time range that we can wait before completely recover the data? .. if you can’t wait at all .. then you should depend on service that allow immediate retrieval for your backup data .. if you can wait .. you should decide for how long .. is it minuets? .. hours ? .. a whole day? .. you decide .. the answer of this question called RTO value .. RTO stands for .. recovery time objective .. usually it’s a time range .. 1 to 3 hours .. 12 to 24 hours .. something like that.
To restore your data .. you request it .. or initiate a retrieval job .. then Glacier will start prepare your data for retrieval .. then Glacier will put a copy of it in S3 .. then you have access to your data copy from S3. regularly, this operation take from 3 to 5 hours .. or from 5 to 12 hours .. or .. recently .. from 1 to 5 minutes. of course each one has a price.
Let’s think in some examples .. you have a website .. you add some contents every couple of days .. you have no problem to lose a week of contents .. you can afford to write these contents again .. so in this case your RPO is one week .. you will take a backup from your website every week .. in event of emergency .. you need your website to recover immediately .. your RTO is almost 0 to 1 minute .. so the best option is to store the backup on S3 .. and you should have a code or script that retrieve this backup from S3 and recover it back to the production. Eventually you will have many backup files .. most of them you will not use .. because you usually need only the last one .. or in rare cases .. the last two or three .. what about the rest? .. you have two options .. whether to delete them .. or, if old backups is also important, you will archive them .. send them to the vault .. the Glacier .. And you can accomplish that via S3 object lifecycle management.
I’m a hobbyist photographer .. I have about 150GB of RAW images .. and counting .. which I don’t do anything with them .. I've exported them to JPG .. and I only use the JPG version .. RAW images are my real assets .. I must protect them .. my computer is not safe place .. external drive also not that safe .. for me .. I found Glacier is OK for me .. the RTO in my case is not important .. I can afford to wait a whole day or even couple of days until I get my hand on the RAW images .. so I can go for Glacier and store my RAW images with the cheapest RTO .. which take 5 to 12 hour to retrieve the single archive.. this scenario will be expensive in case I need to access these data in regular bases .. but, if I already have my Raw images locally .. and I just uploaded a copy of it .. as a disaster recovery solution .. it will not be expensive at all.
The good part about Glacier retrieval options.. it’s not fixed .. you store your data .. then when you request it .. or initiate retrieval job .. you specify which retrieval option you want to use in this request .. this gives you flexibility .. according to the situation .. you can specify a different retrieval option each time for same data.
Glacier known by it’s long retrieval time .. But, recently .. Amazon added one more retrieval option .. 1 to 5 minutes .. this gives Glacier more credits. In case you can’t wait for hours .. you have minutes options .. and as we mentioned before .. if you can’t wait for minutes or more .. you should avoid Glacier.
The cheapest retrieval option is the longest one .. it costs $0.0025 per GB .. The shortest retrieval option costs $0.03 per GB .. the regular retrieval option .. 3 to 5 hours .. costs $0.01 per GB. if your data is small .. it will not make a huge difference .. but if you talking about Terabytes .. it does make a difference .. if you have 1000GB .. retrieval cost will be $2.5 or $10 or $30. Retrieval is not the only cost .. there is also the storage cost .. data transfer cost .. requests cost ..
Storage cost is cheap .. $0.004 per GB .. requests cost is very good .. you only pay for upload request .. any other requests like LISTVAULTS, GETJOBOUTPUT, DELETE is free .. you will pay $0.05 per 1000 UPLOAD request.
Data transfer cost is the trap .. transfer data in same region is free .. cool .. to another region .. $0.02 per GB .. mmm .. transfer data from Glacier to internet..first 1GB is free .. then .. $0.09 per GB .. oh.
If you retrieve your data from Glacier to EC2 machine in same region .. it’s fine .. no cost at all .. but, if you retrieve it from aws to your local machine .. this could be a deal breaker if your data is huge. For your 1000GB you will pay $90 for data transfer + regular RTO $10 = $100.
Actually .. when it comes to cost calculation in aws services .. you should give it your full attention. There are a lot of details. It’s better to read all these details carefully from aws documentation .. rather than read it from you bill.