S3 storage classes and data lakes

S3 storage classes and data lakes

I assume that most of you are aware that data in a data lakes is physically stored in S3. The reasons for that are many and you can read more on that here:https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/amazon-s3-data-lake-storage-platform.html

Cool heads will prevail

One on the benefits is that S3 supports multiple storage classes, offering different pricing depending on the frequency with which your data gets accessed. This is what I want to dive deeper into in this article. And storage classes refer to the concept of cold, warm, and hot data.

Cold data refers to data that is infrequently accessed, often historical data, which is kept either for further research or compliance. Certain sectors are required to keep data for 7 or even 10 years and have strict compliance standards to adhere to. Future research could be for example historical data to train future ML algorithms.

Warm data is accessed frequently, without requiring very low latency. It’s the data that gets ingested into your landing zone, cleaned and sent to your clean zone and further transformed and prepared to data analysis through (automated) ETL jobs and stored in your curated data zone. The data is used by other analytical engines.

Hot data can then be defined as highly frequently accessed data that requires an extremely low and consistent latency. There are several solutions for this type of data. Within the overall modern data architecture, I could load the data directly into my data analytics engine of choice. For example, load it into my Data Warehouse (e.g., Amazon Redshift), my big data platform (e.g., my own big data application running on EMR - Elastic Map Reduce, the AWS managed Hadoop cluster), ... We will see however, that there is a brand-new storage class that might just be what you were looking for, allowing your data to stay in your data lake and reducing data movement.

Storage classes

There are many different storage classes available for each type of data:

Frequent access: The S3 Standard storage class is meant for frequently accessed data, or warm data. It provides immediate access to data (millisecond response time). It will charge a higher price per GB stored, but no data retrieval cost.

Infrequent access: there are two storage classes supported infrequently accessed data, namely S3 Standard-Infrequent Access and S3 One Zone-Infrequent Access. Both will have a significantly lower storage cost than S3 Standard, however will come at a data retrieval cost. To know whether or not an S3 Standard storage class of Infrequent Access storage class is the most cost effective, please look at S3 pricing: https://aws.amazon.com/s3/pricing/ As a rule of thumb, once you access the data more than once a month, you will end up being more cost-effective storing it in S3 Standard. Please be aware that there is a minimum charge of 30 days for data being stored in Infrequent Access, in other words, if you are not planning to keep the data at least 30 days in that storage class, you might reconsider. The only difference between the two infrequent access storage classes is that the One Zone-Infrequent Access storage class stores it’s data in only one Availability Zone. So, check your high availability requirements before choosing.

Archival storage: Here we come to really cold data: data that is stored but we are not expecting to access it for several months, maybe only once or twice a year. This is where all that compliance data comes in. There are however quite a few different storage classes that make up our archival storage:

  • Glacier Instant Retrieval storage class: different from all other archival storage classes, this one gives you instant (millisecond) access to your data. Contrary to the other archival storage classes, with Instant Retrieval, you do not need to restore objects before you can get access to them. This has as a consequence that objects in this storage class can be queried directly with Athena. There is a minimum charge of 90 days for data being stored in this storage class.
  • Glacier Flexible Retrieval storage class: data here cannot be directly queried. Storage cost is very low and there are multiple retrieval options, ranging from minutes to hours. It is worth having a look at the pricing for each of these options (see again: https://aws.amazon.com/s3/pricing/ ). There is a minimum charge of 90 days for data being stored in this storage class.
  • Glacier Deep Archive storage class: data here cannot be directly queried. The storage (per GB) cost is extremely low, and retrieval is measure in hours (12 hours minimum). There is a minimum charge of 180 days for data being stored in this storage class. This would be for data you are not expecting to access more than one or twice per year and is stored for compliance purposes.

There is however a new kid on the block: S3 Express One Zone. Announced at re:Invent2023, Express One Zone is more than just a new storage class, it introduces a new type of S3 bucket, called the 'directory bucket'. This means that we now have two types of buckets (see screenshot below):

  1. General-purpose buckets: buckets as we have know until now, used for all the other storage classes we talked about earlier in this article.
  2. Directory buckets: more on these in a minute

Express One Zone storage class is meant for workloads that are:

  • request intensive (for example training ML models)
  • require millions of transactions per minute, independent of storage size
  • require consistent single digit millisecond PUT and GET latencies (up to 10 times faster than S3 Standard storage class)

This will help bring down the Total Cost of Ownership (TCO) as your compute resources are not sitting idle while waiting for data to be retrieved. This also avoids having to move data into dedicated platforms (like data warehouses or EMR) and keeping with the one source of truth of data lakes. There are three elements involved in supporting Express One Zone:

  1. Directory buckets: this is a new high-performance bucket type of hierarchical namespaces. The support for extremely high numbers of TPS (Transactions Per Second) is independent of the number of folders (or prefixes) in the bucket (as would be the case with general-purpose buckets). For more information on directory buckets: https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
  2. Co-location of compute resource with bucket: as you can guess from the name, the Express One Zone storage class stores data (redundantly) in one Availability Zone. To. make sure we get the most out of the low latency nature of the directory bucket, it is recommended to place your compute resources (if possible) in the same availability zone.
  3. Session based authentication: this is a faster authentication model, based on setting up a session (using the CreateSession API call) and receiving temporary credentials. This session token is included in subsequent requests.

Please make yourselves familiar with the pricing structure of this new storage class: https://aws.amazon.com/s3/pricing/

Retrieving archived objects

Data stored in the Glacier Instant Retrieval storage class, can still be considered part of your data lake. In the case of the other two archival storage classes, you will need to restore the object first before you can access it. To access an object in Glacier Flexible Retrieval and Glacier Deep Archive, you must restore a temporary copy of the object to its S3 bucket for a specified duration (number of days). If you want a permanent copy of the object, restore the object, and then create a copy of it in your Amazon S3 bucket. Copying restored objects isn't supported in the Amazon S3 console. For this type of copy operation, use the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the REST API. Please be aware that when you restore an archived object from S3 Glacier, you pay for both the archived object and the copy that you restored temporarily.When you want to restore an object you need to provide the following details:

  • go to the bucket and select the object you want to restore
  • number of days the temporary copy will be available
  • retrieval tier: for Deep Archive you can choose between Bulk and Standard retrieval (bulk will take longer but comes at a significantly lower retrieval cost)for Flexible Retrieval you can choose between Expedited, Standard and Bulk (from fast to slow, again be aware of the difference in price).

Moving around

Finally, we do not have to stay within a storage class, we can move around if the frequency with which we access our data changes over time. There are basically two mechanisms:

  1. Lifecycle rules: here you can set automatic rules to move between storage classes, based on either the number of days an object has been in a specific storage class or the number of versions of the object are available (if versioning is enabled). Note though that this would require you to know how the access patterns of your data changes over time (i.e., you would have predictable access patterns) and it only allows specific transitions (see diagram below). From the diagram, you can see that the new Express One Zone is not included (since this is supported by an entirely different bucket type).

  1. S3 Intelligent-Tiering storage class: we have one more storage class for you. This one supports multiple access tiers internally in the storage class. The access tiers are: frequent access tier: This is the default access tier, an object stays in here as long as it is being access. Storage and retrieval cost corresponds with S3 Standard storage class.infrequent access tier: If an object is not accessed for 30 consecutive days, the object moves to the Infrequent Access tier. Storage and retrieval cost corresponds with S3 Standard-Infrequent Access storage class.archive instant access tier: If an object is not accessed for 90 consecutive days, the object moves to the Archive Instant Access tier. This tier obviously corresponds with the Glacier Instant Retrieval storage class.archive access tier: This is an optional tier. After activation, the Archive Access tier automatically archives objects that have not been accessed for a minimum of 90 consecutive days. You can extend the last access time for archiving to a maximum of 730 days. The Archive Access tier has the same performance as the S3 Glacier Flexible Retrieval storage class. deep archive access tier: This is an optional tier. After activation, the Deep Archive Access tier automatically archives objects that have not been accessed for a minimum of 180 consecutive days. You can extend the last access time for archiving to a maximum of 730 days. The Deep Archive Access tier has the same performance as the S3 Glacier Deep Archive storage class.

Remark: When you restore an object from S3 Intelligent-Tiering there are no retrieval charges for Standard or Bulk retrievals. Subsequent restore requests called on archived objects that have already been restored are billed as a GET request.


Resources for further deep dive

For more details on S3 storage classes: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.htmlFor more details on Express One Zone:

For more details on restoring objects from Glacier:

Whitepaper on Storage Best Practices on Data and Analytics Applications: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html

要查看或添加评论,请登录

社区洞察

其他会员也浏览了