登录查看更多内容

Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Dunith Danushka

Product Marketing at EDB | LinkedIn Top Voice | Writer | Data Educator

发布日期: 2025年1月27日

Setting up a data lakehouse environment in the cloud can be daunting and expensive for developers who are just getting started. Cloud resources like object storage, compute clusters, and metadata services can quickly add up in costs while you're still learning and experimenting with features. Moreover, the complexity of configuring multiple services to work together can be overwhelming for newcomers.

This tutorial aims to solve these challenges by providing a completely local development environment using Docker. You'll be able to explore Apache Iceberg's features and experiment with different configurations without worrying about cloud costs or complex setups. The environment includes everything you need: Spark for processing, MinIO for storage, and a REST catalog for metadata management.

This guide is particularly useful for:

Data engineers and architects evaluating Iceberg for their organizations
Developers who want to learn Iceberg's features through hands-on practice
Teams looking to set up a local development environment for Iceberg-based projects
Anyone interested in understanding how different components of a data lakehouse work together

In this tutorial, we'll follow the Spark Iceberg Quickstart Guide while taking a detailed look at preparing the data lakehouse infrastructure. Once the infrastructure is ready, we'll perform several end-to-end DDL operations on Iceberg tables, including:

Creating an Iceberg database
Creating an Iceberg table and inserting records with SQL
Querying the Iceberg table with PyIceberg
Examining the Iceberg catalog with PyIceberg CLI

Since this is a lengthy post with extensive code formatting and diagrams, I was unable to render it properly on the LinkedIn text editor. Therefore, I have posted it on my Substack to provide a better reading experience and to make it easier for you to copy the code without any issues.

领英推荐

Databricks: A Contemporary Solution for Today’s Data…

Analytics8 | Data & Analytics Consultancy 2 年前

What is Databricks?

BizOne 2 年前

November 2022 - News You Otter Know

OtterTune 2 年前

You can find the post here: https://tributarydata.substack.com/p/iceberg-lakehouse-on-docker-using

I apologize for the redirection, but I wanted to ensure you have the best possible reading experience.

In the next article, I will provide an example of table-branching and schema evolution. Please let me know in the comments if you need anything additional.

The Tributary

1,265 位关注者

Marvin Lanhenke

Solutions Architect | Self-Taught | Rust | Go | Python

3 周

You can do this with even less infra by using polars and delta instead of Spark. For dev purposes works like a charm

Viktor Kessler

Actionable metadata

3 周

A wonderful example of Open Source , license free , cutting edge and enterprise ready data platform.

1 次回应

Fahad Shah

Developer Advocate at RisingWave Labs | Stream Processing, Real-time Data Analytics, Real-time AI Systems, and Industrial IoT (IIoT)

1 个月

Thanks for sharing this great tutorial, Dunith Danushka! It looks really great for all to get started with Iceberg and get hands-on experience! ??

2 次回应

查看更多评论

要查看或添加评论，请登录

Dunith Danushka的更多文章

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

2025年2月13日

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

This is a special edition of this newsletter where I’m breaking away from my usual technical deep dives and tutorials…

1 条评论
Data and AI in 2025: Looking Beyond the Hype

2024年12月22日

Data and AI in 2025: Looking Beyond the Hype

As another calendar year draws to a close, it's natural to wonder what lies ahead. Predicting the future remains…

4 条评论
Apache Iceberg Quickstart with PyIceberg

2024年12月9日

Apache Iceberg Quickstart with PyIceberg

In my previous post, I shared a study plan for starting your Apache Iceberg journey. Several readers have asked me to…

2 条评论
How I’d Learn Apache Iceberg (if I Had To Start Over)

2024年11月24日

How I’d Learn Apache Iceberg (if I Had To Start Over)

I first heard about Apache Iceberg in 2022. Back then, I didn’t quite understand the concept behind the table formats.

4 条评论
The Great Shift Left: Embracing the Shift Left Data Architecture

2024年10月29日

The Great Shift Left: Embracing the Shift Left Data Architecture

“Shift Left” in data architecture refers to moving data quality controls, testing, and validation earlier in the data…

2 条评论
Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

2024年9月16日

Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

I recently embarked on a two-day trip to Bangalore, India, where I had the opportunity to speak at several data events.…

6 条评论
Developer Advocacy As I Understand It

2024年1月30日

Developer Advocacy As I Understand It

I’m a Solutions Architect turned Developer Advocate. A couple of years ago, I made a bold decision to break into…

4 条评论
Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

2024年1月16日

Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

William Shakespeare: It's as if my quill is broken..

3 条评论
How Does Throttling Work?

2024年1月9日

How Does Throttling Work?

Throttling is the practice of intentionally slowing down or limiting the rate at which data is sent or processed…

4 条评论
Crafting Clarity: How To Structure a Perfect Technical Tutorial?

2024年1月3日

Crafting Clarity: How To Structure a Perfect Technical Tutorial?

A good tutorial often contains seven key sections. A technical tutorial is a detailed instructional guide that teaches…

4 条评论

See all articles

Iceberg Lakehouse on Docker Using Spark, MinIO, PyIceberg, Jupyter Notebooks, and REST Catalog

Dunith Danushka

Product Marketing at EDB | LinkedIn Top Voice | Writer | Data Educator

领英推荐

The Tributary

1,265 位关注者

Dunith Danushka的更多文章

社区洞察

其他会员也浏览了

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

?? Databricks Asset Bundles: A Game-Changer for CI/CD in Databricks! ?????

Just Enough Spark! Core Concepts Revisited !!

Customize Your Own Data Science Platform

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

Databricks Photon and its relation to Apache Spark

Microsoft Fabric Data Engineering - To infinity and beyond

领英推荐

The Tributary

1,265 位关注者

Dunith Danushka的更多文章

Tech Luminaries: A Developer Advocate's Journey in Real-Time Data

Data and AI in 2025: Looking Beyond the Hype

Apache Iceberg Quickstart with PyIceberg

How I’d Learn Apache Iceberg (if I Had To Start Over)

The Great Shift Left: Embracing the Shift Left Data Architecture

Data, DevRel, and Bangalore: 48 Hours in India’s Tech Capital

Developer Advocacy As I Understand It

Unblocking the Inkwell: A Writer’s Guide to Swift Content Creation

How Does Throttling Work?

Crafting Clarity: How To Structure a Perfect Technical Tutorial?

社区洞察

其他会员也浏览了

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

?? Databricks Asset Bundles: A Game-Changer for CI/CD in Databricks! ?????

Just Enough Spark! Core Concepts Revisited !!

Customize Your Own Data Science Platform

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

Databricks Photon and its relation to Apache Spark

Microsoft Fabric Data Engineering - To infinity and beyond