S3 storage classes and data lakes
I assume that most of you are aware that data in a data lakes is physically stored in S3. The reasons for that are many and you can read more on that here:https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/amazon-s3-data-lake-storage-platform.html
Cool heads will prevail
One on the benefits is that S3 supports multiple storage classes, offering different pricing depending on the frequency with which your data gets accessed. This is what I want to dive deeper into in this article. And storage classes refer to the concept of cold, warm, and hot data.
Cold data refers to data that is infrequently accessed, often historical data, which is kept either for further research or compliance. Certain sectors are required to keep data for 7 or even 10 years and have strict compliance standards to adhere to. Future research could be for example historical data to train future ML algorithms.
Warm data is accessed frequently, without requiring very low latency. It’s the data that gets ingested into your landing zone, cleaned and sent to your clean zone and further transformed and prepared to data analysis through (automated) ETL jobs and stored in your curated data zone. The data is used by other analytical engines.
Hot data can then be defined as highly frequently accessed data that requires an extremely low and consistent latency. There are several solutions for this type of data. Within the overall modern data architecture, I could load the data directly into my data analytics engine of choice. For example, load it into my Data Warehouse (e.g., Amazon Redshift), my big data platform (e.g., my own big data application running on EMR - Elastic Map Reduce, the AWS managed Hadoop cluster), ... We will see however, that there is a brand-new storage class that might just be what you were looking for, allowing your data to stay in your data lake and reducing data movement.
Storage classes
There are many different storage classes available for each type of data:
Frequent access: The S3 Standard storage class is meant for frequently accessed data, or warm data. It provides immediate access to data (millisecond response time). It will charge a higher price per GB stored, but no data retrieval cost.
Infrequent access: there are two storage classes supported infrequently accessed data, namely S3 Standard-Infrequent Access and S3 One Zone-Infrequent Access. Both will have a significantly lower storage cost than S3 Standard, however will come at a data retrieval cost. To know whether or not an S3 Standard storage class of Infrequent Access storage class is the most cost effective, please look at S3 pricing: https://aws.amazon.com/s3/pricing/ As a rule of thumb, once you access the data more than once a month, you will end up being more cost-effective storing it in S3 Standard. Please be aware that there is a minimum charge of 30 days for data being stored in Infrequent Access, in other words, if you are not planning to keep the data at least 30 days in that storage class, you might reconsider. The only difference between the two infrequent access storage classes is that the One Zone-Infrequent Access storage class stores it’s data in only one Availability Zone. So, check your high availability requirements before choosing.
Archival storage: Here we come to really cold data: data that is stored but we are not expecting to access it for several months, maybe only once or twice a year. This is where all that compliance data comes in. There are however quite a few different storage classes that make up our archival storage:
There is however a new kid on the block: S3 Express One Zone. Announced at re:Invent2023, Express One Zone is more than just a new storage class, it introduces a new type of S3 bucket, called the 'directory bucket'. This means that we now have two types of buckets (see screenshot below):
Express One Zone storage class is meant for workloads that are:
This will help bring down the Total Cost of Ownership (TCO) as your compute resources are not sitting idle while waiting for data to be retrieved. This also avoids having to move data into dedicated platforms (like data warehouses or EMR) and keeping with the one source of truth of data lakes. There are three elements involved in supporting Express One Zone:
领英推荐
Please make yourselves familiar with the pricing structure of this new storage class: https://aws.amazon.com/s3/pricing/
Retrieving archived objects
Data stored in the Glacier Instant Retrieval storage class, can still be considered part of your data lake. In the case of the other two archival storage classes, you will need to restore the object first before you can access it. To access an object in Glacier Flexible Retrieval and Glacier Deep Archive, you must restore a temporary copy of the object to its S3 bucket for a specified duration (number of days). If you want a permanent copy of the object, restore the object, and then create a copy of it in your Amazon S3 bucket. Copying restored objects isn't supported in the Amazon S3 console. For this type of copy operation, use the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the REST API. Please be aware that when you restore an archived object from S3 Glacier, you pay for both the archived object and the copy that you restored temporarily.When you want to restore an object you need to provide the following details:
Moving around
Finally, we do not have to stay within a storage class, we can move around if the frequency with which we access our data changes over time. There are basically two mechanisms:
Remark: When you restore an object from S3 Intelligent-Tiering there are no retrieval charges for Standard or Bulk retrievals. Subsequent restore requests called on archived objects that have already been restored are billed as a GET request.
Resources for further deep dive
For more details on S3 storage classes: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.htmlFor more details on Express One Zone:
For more details on restoring objects from Glacier:
Whitepaper on Storage Best Practices on Data and Analytics Applications: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html