登录查看更多内容

What’s new in Iceberg 1.1

Tabular (now part of Databricks)

An independent storage platform from the original creators of Apache Iceberg.

发布日期: 2022年12月9日

Author: Ryan Blue

The Apache Iceberg community just released a new version, 1.1.0. In this post, we’ll explore some of the recent highlights:

API stability
Puffin Stats Files
Branching and Tagging
Table Scan Metrics
Changelog Scans
Spark FunctionCatalog

API stability

Iceberg 1.1 comes on the heels of the 1.0 release that added API stability guarantees. While API stability isn’t brand new, it’s still worth noting! From 1.0 onward, the community will maintain binary compatibility within major release versions for the iceberg-api module.

Other modules also have documented guarantees – core and other modules intended for query engine integration will continue to deprecate and support APIs for at least one minor release. Spark modules, for example, can change more rapidly because their purpose is to plug capabilities into Spark’s API rather than providing an API directly.?

Puffin Stats Files

Puffin is a format for storing statistics, indexes, and data sketches. In this release, Puffin files have been added to table metadata to track statistics used by cost-based optimizers, like column distinct value counts (NDVs). Puffin is used both to share these stats across query engines and to maintain the data sketches that produce those stats incrementally.

Branching and Tagging

The 0.14 release added the metadata API to manage table branches and tags, but the features weren’t yet exposed to engines. The 1.1 release adds the ability to append or delete data in a branch, and the ability for Spark to read from a named branch or tag:

df = spark.read
? .option("tag", "q4_2022")
? .table("accounting.transactions")

领英推荐

A Brief Guide to the Governance of Apache Iceberg…

Alex Merced 5 个月前

Summarizing Recent Wins for Apache Iceberg Table Format

Alex Merced 9 个月前

A Guide to dbt Macros - Purpose, Benefits, and Usage

Alex Merced 5 个月前

Table Scan Metrics

An important trick for query performance is to make sure job planning is able to take advantage of metadata indexing. This is now much easier because 1.1 collects and logs scan metrics, including the number of manifests used for planning and the total manifests in the table. This makes it simple to spot cases where queries can be sped up by rewriting metadata.

Changelog Scans

Before 1.1, it was possible to incrementally read data appended to a table, but not data that was deleted. While this worked for fact tables, not all tables change only by inserting rows.

Iceberg 1.1 introduces a new scan type for reading tables incrementally that produces all inserted or deleted rows, along with metadata columns that signal whether the row was added or deleted and when the change happened.

Using the new scan type from Spark is as easy as reading from the changes metadata table:


df = spark.read
? .option("start-snapshot-id", "5186366032052790134")
? .table("taxi.nyc_taxi_yellow.changes")

Spark FunctionCatalog

Iceberg’s internal partition functions are now available from Iceberg catalogs in Spark SQL. This makes it significantly easier to provide a custom sort order in a job that writes to a partitioned table that uses the bucket or truncate functions. The functions are exposed in the system database:


SELECT system.bucket(128, "Thanks, Kyle!") as bucket_val

There’s also a function to return the current Iceberg version:


SELECT system.iceberg_version() as version

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

What’s new in Iceberg 1.1

Tabular (now part of Databricks)

An independent storage platform from the original creators of Apache Iceberg.

API stability

Puffin Stats Files

Branching and Tagging

领英推荐

Table Scan Metrics

Changelog Scans

Spark FunctionCatalog

Tabular (now part of Databricks)的更多文章

社区洞察

其他会员也浏览了

FLaNK Stack Weekly for 5 Feb 2024

First Impressions of Fireducks

Just Enough Spark! Core Concepts Revisited !!

Harnessing the Power of Iceberg

Iceberg REST Catalog Overview #4 — Managing Namespaces

FLaNK Stack Weekly for 09 Oct 2023

Simplifying Apache Spark usage with Optimus

FLaNK Stack Weekly for 31 July 2023

Anatomy of Apache Spark's RDD

Mastering Spring Data JPA – Pagination, Sorting & Custom Queries

API stability

Puffin Stats Files

Branching and Tagging

领英推荐

Table Scan Metrics

Changelog Scans

Spark FunctionCatalog

Tabular (now part of Databricks)的更多文章

August 2023 - Iceberg Community News

Using Airbyte with Tabular

July 2023 - Iceberg Community News

PyIceberg 0.4.0

Securing the Data Lake - Part III

The CDC MERGE Pattern

Iceberg in Modern Data Architecture

June 2023 - Iceberg Community News

CDC Data Gremlins

Hello, World of CDC!

社区洞察

其他会员也浏览了

FLaNK Stack Weekly for 5 Feb 2024

First Impressions of Fireducks

Just Enough Spark! Core Concepts Revisited !!

Harnessing the Power of Iceberg

Iceberg REST Catalog Overview #4 — Managing Namespaces

FLaNK Stack Weekly for 09 Oct 2023

Simplifying Apache Spark usage with Optimus

FLaNK Stack Weekly for 31 July 2023

Anatomy of Apache Spark's RDD

Mastering Spring Data JPA – Pagination, Sorting & Custom Queries