Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can also build and manage Iceberg tables without Spark using PyIceberg and Ray. In this guide, we'll explore how to create, insert, overwrite, UPSERT, and delete data in an Iceberg table using PyIceberg, AWS Glue Metastore, and S3.

Why PyIceberg?

  • No need for Spark – lightweight and fast
  • Works well with Ray for distributed processing
  • Integrates seamlessly with AWS Glue and S3

Let's get started!

?? Setting Up AWS Glue Catalog

First, we need to set up our Iceberg catalog backed by AWS Glue.


This sets up our Iceberg catalog using AWS Glue as the metadata store and S3 as the data warehouse.

?? Defining Schema and Partitioning

Now, let's define our Iceberg schema and partition specification


?? Creating or Loading an Iceberg Table

?? Insert Data


?? Overwrite Data



?? Delete Data


?? Upsert Data

?? Running the Operations

Output



Query Athena



DUCKDB


CODE

https://github.com/soumilshah1995/pyiceberg-upsert-demo/blob/main/pyiceberg-glue.py


?? Conclusion

In this guide, we explored how to build an Iceberg table using PyIceberg without Spark. We covered:

? Setting up AWS Glue Metastore and S3

? Creating Iceberg tables and partitions

? Performing INSERT, DELETE, and UPSERT operations

This hands-on approach provides an easy and efficient way to work with Iceberg tables using Python ??. Try it out and unlock the full potential of Iceberg + PyIceberg! ??

SANDEEP N.

Senior Business Analyst (Business Intelligence & Development)

18 小时前

please help me for my first glue job. issue in iceberg and Spark configuration. 9608879541

回复
Duy Nguyen

Full Digitalized Chief Operation Officer (FDO COO) | First cohort within "Coca-Cola Founders" - the 1st Corporate Venture funds in the world operated at global scale.

1 周

Love this

回复
Rahul Arram

Data Engineering @Tesco | 6+ YOE | Polyglot Engineer | Ex- Wayfair, Turtlemint, Manhattan Associates, Hopscotch | Spark, Kafka, Airflow, Python, Superset

1 周

Insightful

回复

要查看或添加评论,请登录

Soumil S.的更多文章