What’s new in Iceberg 1.1
Tabular (now part of Databricks)
An independent storage platform from the original creators of Apache Iceberg.
Author: Ryan Blue
The Apache Iceberg community just released a new version, 1.1.0. In this post, we’ll explore some of the recent highlights:
API stability
Iceberg 1.1 comes on the heels of the 1.0 release that added API stability guarantees
Other modules also have documented guarantees – core and other modules intended for query engine integration will continue to deprecate and support APIs for at least one minor release. Spark modules, for example, can change more rapidly because their purpose is to plug capabilities into Spark’s API rather than providing an API directly.?
Puffin Stats Files
Puffin is a format for storing statistics, indexes, and data sketches. In this release, Puffin files have been added to table metadata to track statistics used by cost-based optimizers
Branching and Tagging
The 0.14 release added the metadata API to manage table branches and tags, but the features weren’t yet exposed to engines. The 1.1 release adds the ability to append or delete data in a branch, and the ability for Spark to read from a named branch or tag:
df = spark.read
? .option("tag", "q4_2022")
? .table("accounting.transactions")
领英推荐
Table Scan Metrics
An important trick for query performance
Changelog Scans
Before 1.1, it was possible to incrementally read data appended to a table, but not data that was deleted. While this worked for fact tables, not all tables change only by inserting rows.
Iceberg 1.1 introduces a new scan type for reading tables incrementally that produces all inserted or deleted rows, along with metadata columns that signal whether the row was added or deleted and when the change happened.
Using the new scan type from Spark is as easy as reading from the changes metadata table:
df = spark.read
? .option("start-snapshot-id", "5186366032052790134")
? .table("taxi.nyc_taxi_yellow.changes")
Spark FunctionCatalog
Iceberg’s internal partition functions are now available from Iceberg catalogs in Spark SQL. This makes it significantly easier to provide a custom sort order
SELECT system.bucket(128, "Thanks, Kyle!") as bucket_val
There’s also a function to return the current Iceberg version:
SELECT system.iceberg_version() as version