登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Apache Iceberg Quickstart with PyIceberg

Dunith Danushka

Product Marketing at EDB | LinkedIn Top Voice | Writer | Data Educator

发布日期: 2024年12月9日

In my previous post, I shared a study plan for starting your Apache Iceberg journey. Several readers have asked me to provide a super simple quickstart project to help them begin.

So, here we go.

In this hands-on guide, we'll explore Apache Iceberg through PyIceberg, its official Python client. We'll cover three main tasks:

Setting up a local Iceberg catalog using SQLite
Creating and populating an Iceberg table with NYC taxi data with Python.
Working with the PyIceberg CLI.

The goal of this post is to help you begin your Iceberg hands-on journey without getting into the complexities of setting up object stores and query engines. We will do everything with Python while covering the essentials.

What is Pyiceberg?

PyIceberg is the official Python client of the Apache Iceberg project. PyIceberg is by far the easiest way to get started with Iceberg, letting you run DML operations and queries on Iceberg tables, without depending on a query engine.

While query engines like Spark provide Python bindings, PyIceberg offers a pure Pythonic experience and a lightweight way to query Iceberg tables. This makes it perfect for data analysis with your favorite Python tools like Polars, Pandas, and DuckDB.

The easiest way to get started using PyIceberg is to install via pip:

pip install -U "pyiceberg[pyarrow]"

PyIceberg is a modular ecosystem with many extension packages. Depending on your need, you can install optional packages depending on your needs to keep the installation lightweight. We just installed pyarrow alongside PyIceberg, which enables working with PyArrow filesystem backend,(supports S3, HDFS, and others).

You can also install the following extras based on your catalog implementation. The REST catalog implementation is built-in. You either need to install?s3fs,?adlfs,?gcsfs, or?pyarrow?to be able to fetch files from an object store.

Working with PyIceberg CLI

The PyIceberg installation includes the PyIceberg CLI, which lets you easily inspect table metadata through Apache Iceberg catalogs.

On a terminal, type the following to list the available commands.

pyiceberg --help

Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...

Options:
  --catalog TEXT
  --verbose BOOLEAN
  --output [text|json]
  --ugi TEXT
  --uri TEXT
  --credential TEXT
  --help                Show this message and exit.

Commands:
  create      Operation to create a namespace.
  describe    Describe a namespace or a table.
  drop        Operations to drop a namespace or table.
  files       List all the files of the table.
  list        List tables or namespaces.
  list-refs   List all the refs in the provided table.
  location    Return the location of the table.
  properties  Properties on tables/namespaces.
  rename      Rename a table.
  schema      Get the schema of the table.
  spec        Return the partition spec of the table.
  uuid        Return the UUID of the table.
  version     Print pyiceberg version.

Working with a catalog

Iceberg uses a catalog as a central hub for organizing tables. This can be a traditional Hive catalog for storing Iceberg tables alongside others, a vendor solution like AWS Glue catalog, or an implementation of Iceberg's REST protocol. PyIceberg CLI is a great tool for exploring the namespaces and tables in an Iceberg catalog. It allows you to easily find what namespaces and tables exist and then dive deeper into the details.

The pyiceberg list lists all the registered catalogs in the system. If you run it at this point, you should probably see something like this:

pyiceberg list

URI missing, please provide using --uri, the config or environment variable PYICEBERG_CATALOG__DEFAULT__URI

That’s because we haven’t configured any catalog yet. Let’s configure our first catalog next.

There are three ways to configure PyIceberg:

Using the?~/.pyiceberg.yaml?configuration file
Through environment variables
By passing in credentials through the CLI or the Python API

Let’s follow the standard approach by creating a .pyiceberg.yaml file in your home directory, enabling a standardized configuration for multiple catalogs without defining them in code. You can change the directory location by setting the?PYICEBERG_HOME?environment variable.

To keep things simple, let’s create a local SQL catalog that will store information in a local?sqlite?database. We'll also configure the catalog to store data files in the local filesystem instead of an object store. This should not be used in production due to the limited scalability.

Let’s create a temporary location for Iceberg to store our data:

mkdir /tmp/warehouse

The SQL catalog requires a database for its backend. While we are using SQLite here, PyIceberg also supports PostgreSQL through psycopg2.

To finish defining the catalog, create the ./pyiceberg.yaml file in your home directory with the following YAML configuration. The uri parameter configures the connection URL and the warehouse defines where Iceberg is supposed to store the data and metadata.

catalog:
  sandbox:
    type: sql
    uri: sqlite:////tmp/warehouse/pyiceberg_catalog.db
    warehouse: file:///tmp/warehouse

Use the Pyiceberg CLI to verify the catalog sandbox you just created above. It should list the namespaces in the catalog.

pyiceberg --catalog sandbox list

You will see an empty response as we haven’t created any namespace yet. Namespace is a logical grouping that you can use to organize related tables together.

Create the namespace quickstart :

pyiceberg --catalog sandbox create namespace quickstart

Created namespace: quickstart

Refer to this for more information about configuring other catalogs.

Create a table

Now that we configured our catalog, let’s create an Iceberg table and populate it with some data.

Let’s download one month of data from the famous NYC taxi dataset first.

curl <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet> -o /tmp/yellow_tripdata_2023-01.parquet

We will some Python code to create the table and load the taxi data file into it. You could write a script or use Python3 REPL.

The primary starting point for working with the PyIceberg API is the?load_catalog?method that connects to an Iceberg catalog. To create a catalog instance, pass the catalog’s name from your YAML configuration:

from pyiceberg.catalog import load_catalog

catalog = load_catalog('sandbox')

Load the downloaded parquet file into your PyArrow dataframe:

import pyarrow.parquet as pq

df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")

Create a new Iceberg table in the quickstart namespace that uses the dataframe schema. Then append the dataframe to the table:

table = catalog.create_table(
    "quickstart.taxi_dataset",
    schema=df.schema,
)
table.append(df)

To check how many records were written to the table, scan it as an Arrow dataframe and get its length.

len(table.scan().to_arrow())

This should result in 3066766.

The complete Python code we've written so far would look like this:

from pyiceberg.catalog import load_catalog
import pyarrow.parquet as pq

df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")

catalog = load_catalog('sandbox')

table = catalog.create_table(
    "quickstart.taxi_dataset",
    schema=df.schema,
)

table.append(df)

print(len(table.scan().to_arrow()))

If you use the list command in the CLI, our new table will appear in the quickstart namespace.

pyiceberg --catalog sandbox list quickstart

quickstart.taxi_dataset

Explore Iceberg data and metadata files

Since the catalog was configured to use the local filesystem, we can explore how Iceberg saved data and metadata files from the above operations.

find /tmp/warehouse/

Wrap up

In this post, we explored how to get started with Apache Iceberg using PyIceberg, the official Python client. We set up a local catalog, created our first Iceberg table, and loaded some real-world NYC taxi data into it. This shows us how straightforward it is to begin working with Iceberg tables in a pure Pythonic environment.

While this example used a local setup for simplicity, the same principles apply when working with production-grade catalogs and object stores. PyIceberg's flexibility allows you to easily switch between different catalog implementations and storage backends.

In the next post, I'll show you how to read an Iceberg table as a DuckDB table, which will open up more interesting querying possibilities and demonstrate Iceberg's interoperability with other data tools.

The Tributary

1,265 位关注者

Fahad Shah

Developer Advocate at RisingWave Labs | Stream Processing, Real-time Data Analytics, Real-time AI Systems, and Industrial IoT (IIoT)

2 个月

Thanks for sharing this, Dunith! I think now is the perfect time to get hands-on with Iceberg, given all the excitement surrounding it!

1 次回应

查看更多评论

要查看或添加评论，请登录

Dunith Danushka的更多文章

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

2025年2月13日

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

This is a special edition of this newsletter where I’m breaking away from my usual technical deep dives and tutorials…

1 条评论
Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

2025年1月27日

Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Setting up a data lakehouse environment in the cloud can be daunting and expensive for developers who are just getting…

3 条评论
Data and AI in 2025: Looking Beyond the Hype

2024年12月22日

Data and AI in 2025: Looking Beyond the Hype

As another calendar year draws to a close, it's natural to wonder what lies ahead. Predicting the future remains…

4 条评论
How I’d Learn Apache Iceberg (if I Had To Start Over)

2024年11月24日

How I’d Learn Apache Iceberg (if I Had To Start Over)

I first heard about Apache Iceberg in 2022. Back then, I didn’t quite understand the concept behind the table formats.

4 条评论
The Great Shift Left: Embracing the Shift Left Data Architecture

2024年10月29日

The Great Shift Left: Embracing the Shift Left Data Architecture

“Shift Left” in data architecture refers to moving data quality controls, testing, and validation earlier in the data…

2 条评论
Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

2024年9月16日

Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

I recently embarked on a two-day trip to Bangalore, India, where I had the opportunity to speak at several data events.…

6 条评论
Developer Advocacy As I Understand It

2024年1月30日

Developer Advocacy As I Understand It

I’m a Solutions Architect turned Developer Advocate. A couple of years ago, I made a bold decision to break into…

4 条评论
Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

2024年1月16日

Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

William Shakespeare: It's as if my quill is broken..

3 条评论
How Does Throttling Work?

2024年1月9日

How Does Throttling Work?

Throttling is the practice of intentionally slowing down or limiting the rate at which data is sent or processed…

4 条评论
Crafting Clarity: How To Structure a Perfect Technical Tutorial?

2024年1月3日

Crafting Clarity: How To Structure a Perfect Technical Tutorial?

A good tutorial often contains seven key sections. A technical tutorial is a detailed instructional guide that teaches…

4 条评论

See all articles

What is Pyiceberg?

Working with PyIceberg CLI

Working with a catalog

Create a table

Explore Iceberg data and metadata files

Wrap up

The Tributary

1,265 位关注者

Dunith Danushka的更多文章

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Data and AI in 2025: Looking Beyond the Hype

How I’d Learn Apache Iceberg (if I Had To Start Over)

The Great Shift Left: Embracing the Shift Left Data Architecture

Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

Developer Advocacy As I Understand It

Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

How Does Throttling Work?

Crafting Clarity: How To Structure a Perfect Technical Tutorial?

社区洞察