Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Setting up a data lakehouse environment in the cloud can be daunting and expensive for developers who are just getting started. Cloud resources like object storage, compute clusters, and metadata services can quickly add up in costs while you're still learning and experimenting with features. Moreover, the complexity of configuring multiple services to work together can be overwhelming for newcomers.

This tutorial aims to solve these challenges by providing a completely local development environment using Docker. You'll be able to explore Apache Iceberg's features and experiment with different configurations without worrying about cloud costs or complex setups. The environment includes everything you need: Spark for processing, MinIO for storage, and a REST catalog for metadata management.

This guide is particularly useful for:

  • Data engineers and architects evaluating Iceberg for their organizations
  • Developers who want to learn Iceberg's features through hands-on practice
  • Teams looking to set up a local development environment for Iceberg-based projects
  • Anyone interested in understanding how different components of a data lakehouse work together

In this tutorial, we'll follow the Spark Iceberg Quickstart Guide while taking a detailed look at preparing the data lakehouse infrastructure. Once the infrastructure is ready, we'll perform several end-to-end DDL operations on Iceberg tables, including:

  • Creating an Iceberg database
  • Creating an Iceberg table and inserting records with SQL
  • Querying the Iceberg table with PyIceberg
  • Examining the Iceberg catalog with PyIceberg CLI

Since this is a lengthy post with extensive code formatting and diagrams, I was unable to render it properly on the LinkedIn text editor. Therefore, I have posted it on my Substack to provide a better reading experience and to make it easier for you to copy the code without any issues.


You can find the post here: https://tributarydata.substack.com/p/iceberg-lakehouse-on-docker-using


I apologize for the redirection, but I wanted to ensure you have the best possible reading experience.

In the next article, I will provide an example of table-branching and schema evolution. Please let me know in the comments if you need anything additional.



Marvin Lanhenke

Solutions Architect | Self-Taught | Rust | Go | Python

3 周

You can do this with even less infra by using polars and delta instead of Spark. For dev purposes works like a charm

回复
Viktor Kessler

Actionable metadata

3 周

A wonderful example of Open Source , license free , cutting edge and enterprise ready data platform.

Fahad Shah

Developer Advocate at RisingWave Labs | Stream Processing, Real-time Data Analytics, Real-time AI Systems, and Industrial IoT (IIoT)

1 个月

Thanks for sharing this great tutorial, Dunith Danushka! It looks really great for all to get started with Iceberg and get hands-on experience! ??

要查看或添加评论,请登录

Dunith Danushka的更多文章

社区洞察

其他会员也浏览了