How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

In today's cloud-native data management world, managing the cost of storage is a significant consideration. Amazon S3, a popular storage service, provides a simple yet powerful feature called object tagging that allows users to categorize storage. These object tags, in the form of key-value pairs, can be leveraged to optimize the storage costs of large datasets in Amazon S3, especially when integrated with technologies like Apache Iceberg.

One of the most valuable applications of S3 object tags is when used in conjunction with Apache Iceberg tables to move expired snapshots into cheaper storage classes, such as Glacier Deep Archive, effectively saving on storage costs. In this blog post, we will explore how to configure object tagging in Iceberg tables created by EMR Serverless and set up lifecycle policies to archive expired snapshots. We will walk you through a hands-on lab to implement these concepts.

What Are S3 Object Tags?

Amazon S3 Object Tags are user-defined metadata in the form of key-value pairs that can be attached to objects stored in Amazon S3. These tags help categorize and manage objects based on their characteristics or business logic. Tags can be used for:

  • Object lifecycle management (e.g., moving data to a lower-cost storage class after a certain period).
  • Implementing data retention policies.
  • Applying access controls based on tags.

Video Guide


Apache Iceberg and S3 Integration

Apache Iceberg is a high-performance table format designed for large analytic datasets. Iceberg tables can be stored in object stores like Amazon S3. When working with Iceberg, users can benefit from S3 object tagging to implement custom data management policies, such as archiving expired snapshots or moving data to different storage classes.

Iceberg provides support for S3 object tags through configurations that allow users to tag objects during writes and deletes. The tags can also be used in S3 Lifecycle Policies to automate storage transitions and deletions.

Key Configurations for S3 Object Tagging

To enable object tagging, we need to configure the following properties in the Spark session:

Here’s what these configurations do:

  • write-tag-name: Tags objects when they are written into the Iceberg table, helping identify when data was created.
  • delete-tag-name: Tags objects before deletion, useful for managing the deletion lifecycle.

This session is configured to:

  • Use AWS Glue as the catalog.
  • Store data in S3.
  • Enable object tagging for writes and deletes.
  • Disable physical deletes by using lifecycle policies.

Step 2: Create Iceberg Table and Write Data

Next, we will define the schema for the data and write records to the Iceberg table.

Step 3: Verify Data Write and Table History

After writing the data, we can check if the data is stored correctly in the table and view the table's history.

Step 4: Perform Delete Operations

We will delete records from the Iceberg table, and since object deletion is disabled (s3.delete-enabled = false), the data won’t be physically deleted but tagged for lifecycle management.

Step 5: Expire Snapshots

Expired snapshots need to be removed. We can expire snapshots but keep the last few to maintain data consistency.

Step 6: List S3 Objects and Check Tags

Now that we’ve written and deleted some data, let’s verify the tags associated with the S3 objects.

Sample Output:

Automate Moving Expired Snapshots to Glacier Deep Archive

Finally, you can automate moving these expired snapshots to Glacier Deep Archive by configuring S3 lifecycle policies based on the tags applied to the objects.

  1. Set up an S3 Lifecycle policy for objects with the tag deleted = true.
  2. Transition these objects to the Glacier Deep Archive storage class after a set period of time (e.g., 30 days).

Complete exercise files and steps to submit job to EMR setup life cycle policy can be found

https://github.com/soumilshah1995/emr-iceberg-tags-demo/blob/main/README.md

Conclusion

By using S3 Object Tags in conjunction with Apache Iceberg and EMR Serverless, you can optimize storage costs by efficiently moving expired snapshots to lower-cost storage classes like Glacier Deep Archive. The hands-on lab provided demonstrates how to configure these tags, perform deletions, and automate lifecycle management

References

https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/emr-iceberg-storage-optimizations.ipynb

https://aws.amazon.com/blogs/big-data/improve-operational-efficiencies-of-apache-iceberg-tables-built-on-amazon-s3-data-lakes/



要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了