PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.0 introduces native upsert functionality, allowing you to merge data directly into your Iceberg tables without the need for Spark or other external processing engines. This significantly simplifies data management workflows, especially when dealing with incremental updates and changes.

Video

Getting Started

Before you dive into the code, make sure you have the latest version of PyIceberg installed:

Example: Managing Site Messages

Let's walk through a practical example of how to use the new upsert functionality. Imagine we are managing site messages, where each message is associated with a site and has a unique ID. We want to be able to update existing messages and add new ones as needed


Initial Data



Output


Benefits

  • Simplified Data Pipelines: No more reliance on Spark for basic upsert operations.
  • Increased Efficiency: Direct manipulation of Iceberg tables from Python code is faster and more resource-efficient.
  • Greater Flexibility: Integrate upsert operations seamlessly into your existing Python workflows.

PyIceberg's new upsert functionality opens up a world of possibilities for managing your data lakes. Embrace this powerful feature and simplify your data workflows today!


Code : https://github.com/soumilshah1995/pyiceberg-upsert-demo/blob/main/README.md


References

https://github.com/apache/iceberg-python/issues/402

https://github.com/apache/iceberg-python/pull/1665

Venkata Reddy

Jawaharlal Nehru Technology University, Anathapur

1 周

Very informative

回复
Pablo V.

Mathematician | Data Engineer @ Electronic Arts (EA)

1 周

very great news!

回复
Antonio G.

Azure Data Engineer | Big Data Engineer | Databricks | Python | Spark | DP-203

1 周

Great news

回复
Duy Nguyen

Full Digitalized Chief Operation Officer (FDO COO) | First cohort within "Coca-Cola Founders" - the 1st Corporate Venture funds in the world operated at global scale.

1 周

Very informative

回复
Ananth P.

Data Engineer | Editor Data Engineering Weekly | Angel Investor| Advisor for early stage data startups| Let's chat about data engineering | Book me here calendly.com/apackkildurai

1 周

The implementation is good (PyIceberg implementation of upsert). The code does the partition pruning. There is an explicit ask for the partition key, instead of only the primary key. I wonder how it will figure out the partition pruning.

要查看或添加评论,请登录

Soumil S.的更多文章