PyIceberg 0.4.0
Tabular (now part of Databricks)
An independent storage platform from the original creators of Apache Iceberg.
Happy to announce that?PyIceberg 0.4.0?has been released, and is packed with many new features! PyIceberg is a pure Python implementation for reading Iceberg tables into your favorite engine.
This new release is a major step in the maturity of PyIceberg. You can now expect a major speedup of the queries, ease of use improvements, and new features. With each release, PyIceberg is getting closer to implementing the full Iceberg specification.
Enhancements include:
Let’s dive in!
Evaluation of Iceberg metrics
Iceberg uses metrics to speed up queries. These are now also used in?PyIceberg query planning. The metrics include:
Here’s an example: If you run a query with?WHERE ts_delivered IS NULL?these metrics let Iceberg know to skip data files that don’t contain any null values for that column. This greatly reduces IO and results in a dramatic increase in query speed.
Support for positional deletes
Deletes are problematic in a world where the data lives remotely in an object store, and you’re using immutable file formats such as Parquet and ORC. There are multiple ways of doing deletes within Iceberg, each with pros and cons.
There are two main strategies:
PyIceberg 0.4.0?now supports positional deletes. Check?this video?if you want to learn more about positional deletes.
SQL Style filters
A typical query in PyIceberg looks like this:
from pyiceberg.expressions import GreaterThanOrEqual
table.scan(
? ? row_filter=GreaterThanOrEqual("passengers", 5)
).to_arrow()
Now, with PyIceberg 0.4.0 can do?SQL-style queries instead:
领英推荐
table.scan(
? ? row_filter="passengers >= 5")
).to_arrow()
This way you don’t have to import the expressions and you can reuse your SQL knowledge.
Peek into a dataset
When exploring data, sometimes you want to be able just to take a look at the data without fetching the whole dataset. Now, however, you can?set a limit parameter:
table.scan(
? ? limit=100
).to_arrow()
This only reads files until the first 100 records are fetched. This makes exploring data in Iceberg tables much more responsive.
Setting table properties
Setting table properties is?finally here! This long-awaited feature is a required step for write support for PyIceberg.
with table.transaction() as transaction:
? ? transaction.set_properties(last_updated=str(datetime.now()))
This is now available for the REST catalog. Interested in contributing to the PyIceberg project? We’d love to have you! It would be great to add support for the Glue, DynamoDB, and Hive catalogs.
Reduced calls to the object store
This is a great example of where Open Source shines and helps to improve code across projects. Historically, when you profiled the read path in PyIceberg, the application made many calls to the object store. Since the goal of the Apache Iceberg Project is to reduce unnecessary IO we took a closer look. Turns out, upon closer inspection we discovered?a bug in PyArrow?causing it to do an unnecessary call to fetch the Parquet metadata. Make sure to?upgrade to PyArrow >= 12.0.0?to see improved performance, and reduced costs on the object store.
Complete makeover of the docs
Last but not least, I’m very excited about the?complete makeover of the docs. The docs have been updated with a clean new theme and much more content. We have also added examples and published the classes with the PyDocs.
Before:
As of 0.4.0:
This was a great effort by the community. You can find the docs at?https://py.iceberg.apache.org/
Give PyIceberg a try!
The list on this page is just the highlighted features, there are?many more bugfixes and small improvements. It’s available now on pip. For details,?please check the docs site. Make sure to give it a try! If you run into anything, feel free to reach out in the?#python?channel on the?Iceberg Slack.