登录查看更多内容

How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2024年11月27日

In today's cloud-native data management world, managing the cost of storage is a significant consideration. Amazon S3, a popular storage service, provides a simple yet powerful feature called object tagging that allows users to categorize storage. These object tags, in the form of key-value pairs, can be leveraged to optimize the storage costs of large datasets in Amazon S3, especially when integrated with technologies like Apache Iceberg.

One of the most valuable applications of S3 object tags is when used in conjunction with Apache Iceberg tables to move expired snapshots into cheaper storage classes, such as Glacier Deep Archive, effectively saving on storage costs. In this blog post, we will explore how to configure object tagging in Iceberg tables created by EMR Serverless and set up lifecycle policies to archive expired snapshots. We will walk you through a hands-on lab to implement these concepts.

What Are S3 Object Tags?

Amazon S3 Object Tags are user-defined metadata in the form of key-value pairs that can be attached to objects stored in Amazon S3. These tags help categorize and manage objects based on their characteristics or business logic. Tags can be used for:

Object lifecycle management (e.g., moving data to a lower-cost storage class after a certain period).
Implementing data retention policies.
Applying access controls based on tags.

Video Guide

Apache Iceberg and S3 Integration

Apache Iceberg is a high-performance table format designed for large analytic datasets. Iceberg tables can be stored in object stores like Amazon S3. When working with Iceberg, users can benefit from S3 object tagging to implement custom data management policies, such as archiving expired snapshots or moving data to different storage classes.

Iceberg provides support for S3 object tags through configurations that allow users to tag objects during writes and deletes. The tags can also be used in S3 Lifecycle Policies to automate storage transitions and deletions.

Key Configurations for S3 Object Tagging

To enable object tagging, we need to configure the following properties in the Spark session:

Here’s what these configurations do:

write-tag-name: Tags objects when they are written into the Iceberg table, helping identify when data was created.
delete-tag-name: Tags objects before deletion, useful for managing the deletion lifecycle.

This session is configured to:

Use AWS Glue as the catalog.
Store data in S3.
Enable object tagging for writes and deletes.
Disable physical deletes by using lifecycle policies.

Step 2: Create Iceberg Table and Write Data

Next, we will define the schema for the data and write records to the Iceberg table.

Step 3: Verify Data Write and Table History

After writing the data, we can check if the data is stored correctly in the table and view the table's history.

领英推荐

The Next Big Thing In Big Data: BDaaS

Bernard Marr 9 年前

What is the Data Lakehouse and the Role of Apache…

Alex Merced 1 年前

Dremio Monthly Reflections: March 2024

Dremio 1 年前

Step 4: Perform Delete Operations

We will delete records from the Iceberg table, and since object deletion is disabled (s3.delete-enabled = false), the data won’t be physically deleted but tagged for lifecycle management.

Step 5: Expire Snapshots

Expired snapshots need to be removed. We can expire snapshots but keep the last few to maintain data consistency.

Step 6: List S3 Objects and Check Tags

Now that we’ve written and deleted some data, let’s verify the tags associated with the S3 objects.

Sample Output:

Automate Moving Expired Snapshots to Glacier Deep Archive

Finally, you can automate moving these expired snapshots to Glacier Deep Archive by configuring S3 lifecycle policies based on the tags applied to the objects.

Set up an S3 Lifecycle policy for objects with the tag deleted = true.
Transition these objects to the Glacier Deep Archive storage class after a set period of time (e.g., 30 days).

Complete exercise files and steps to submit job to EMR setup life cycle policy can be found

https://github.com/soumilshah1995/emr-iceberg-tags-demo/blob/main/README.md

Conclusion

By using S3 Object Tags in conjunction with Apache Iceberg and EMR Serverless, you can optimize storage costs by efficiently moving expired snapshots to lower-cost storage classes like Glacier Deep Archive. The hands-on lab provided demonstrates how to configure these tags, perform deletions, and automate lifecycle management

References

https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/emr-iceberg-storage-optimizations.ipynb

https://aws.amazon.com/blogs/big-data/improve-operational-efficiencies-of-apache-iceberg-tables-built-on-amazon-s3-data-lakes/

要查看或添加评论，请登录

Soumil S.的更多文章

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

2025年3月29日

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

When it comes to organizing data for multi-tenant applications, one of the key architectural decisions is how to manage…
Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

2025年3月25日

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

We’ll be diving into AWS Managed Iceberg and exploring the latest features of S3 table buckets. Gain hands-on…

4 条评论
Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…

1 条评论
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

4 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论

See all articles

How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

What Are S3 Object Tags?

Apache Iceberg and S3 Integration

Key Configurations for S3 Object Tagging

Step 2: Create Iceberg Table and Write Data

Step 3: Verify Data Write and Table History

领英推荐

Step 4: Perform Delete Operations

Step 5: Expire Snapshots

Step 6: List S3 Objects and Check Tags

Sample Output:

Automate Moving Expired Snapshots to Glacier Deep Archive

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Snowflake

Boost efficiency & cut costs with IDIP Data Fabric

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Data Management News for the Week of October 11; Updates from Cloudera, Snowflake, Teradata & More

Architecting Data Pipelines with Azure Data Lake and Azure Synapse

Data Management News for the Week of May 3; Updates from Dremio, Quantum, Teradata & More

Accelerating Data Modernization: Databricks Teams Up with BladeBridge

Apache Hudi: The Transactional Data Lake Revolutionizing Big Data Processing

S3 storage classes and data lakes

Snowflake Horizon and Open Catalog: Revolutionizing Data Management with Apache Iceberg

What Are S3 Object Tags?

Apache Iceberg and S3 Integration

Key Configurations for S3 Object Tagging

Step 2: Create Iceberg Table and Write Data

Step 3: Verify Data Write and Table History

领英推荐

Step 4: Perform Delete Operations

Step 5: Expire Snapshots

Step 6: List S3 Objects and Check Tags

Sample Output:

Automate Moving Expired Snapshots to Glacier Deep Archive

Conclusion

Soumil S.的更多文章

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

社区洞察

其他会员也浏览了

Snowflake

Boost efficiency & cut costs with IDIP Data Fabric

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Data Management News for the Week of October 11; Updates from Cloudera, Snowflake, Teradata & More

Architecting Data Pipelines with Azure Data Lake and Azure Synapse

Data Management News for the Week of May 3; Updates from Dremio, Quantum, Teradata & More

Accelerating Data Modernization: Databricks Teams Up with BladeBridge

Apache Hudi: The Transactional Data Lake Revolutionizing Big Data Processing

S3 storage classes and data lakes

Snowflake Horizon and Open Catalog: Revolutionizing Data Management with Apache Iceberg