Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年3月13日

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become increasingly important. This blog post explores a powerful combination: AWS S3 Table Buckets + Apache Iceberg + Nested JSON + DuckDB - a solution that delivers blazing-fast in-memory analytics.

The Perfect Combo: Why It Works

AWS S3 Table Buckets provide a fully managed Iceberg solution on AWS, giving you an optimized storage layer for your data. While variant type support is still developing, you can already store nested JSON as strings and query them efficiently with DuckDB, an in-memory analytical database that excels at processing complex data.

This approach gives you:

Scalability of S3 storage
Performance of DuckDB in-memory analytics
Flexibility of JSON for nested structures
Versioning and time travel capabilities of Apache Iceberg

Creating an Iceberg Table with Nested JSON

Let's first create an Iceberg table that stores nested JSON data as strings. The following Python script uses PyIceberg to create a table and insert sample customer data with nested contact information:

The key insight here is that we're storing complex JSON structures as strings in the contact_info column. This approach provides flexibility while Iceberg continues to develop its native variant/struct support.

Querying Nested JSON with DuckDB

Now for the exciting part—querying this data with DuckDB. DuckDB's in-memory processing combined with its robust JSON extraction functions delivers impressive performance.

Understanding the Query

Let's break down what's happening in the query:

Setup and Authentication: We install and load all necessary extensions, set up AWS credentials, and connect to our Iceberg catalog.
Basic Query: First, we query all data to verify the table structure.
JSON Extraction: The magic happens with json_extract() functions that let us:Extract top-level fields like email and phoneNavigate nested structures like address.street and address.city

his approach gives you full SQL query capabilities over nested JSON data without waiting for Varient support in Iceberg.

Performance Benefits

This solution offers several performance advantages:

In-memory Processing: DuckDB loads data into memory for blazing-fast analysis
Columnar Storage: Iceberg's columnar format enables efficient data access
Selective Querying: Only extract the JSON fields you need
Parallelization: DuckDB can parallelize JSON extraction across multiple cores

Looking Ahead

While storing JSON as strings is a powerful approach today, keep an eye on upcoming Iceberg features:

Native variant support that will make querying even faster Ticket
Nested column pruning improvements
JSON-specific optimizations in the Iceberg format

Conclusion

The combination of AWS S3 Table Buckets, Apache Iceberg, and DuckDB provides a flexible and high-performance solution for working with nested JSON data. This approach bridges the gap while waiting for full variant support and offers immediate benefits for organizations with complex data structures.

By leveraging the techniques described in this blog post, you can achieve both the flexibility of JSON and the analytical power of a modern data lake architecture.

Soumil S.

1 周

Repo https://github.com/soumilshah1995/s3-iceberg-json-duckdb-

1 次回应

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论

See all articles

The Perfect Combo: Why It Works

Creating an Iceberg Table with Nested JSON

Querying Nested JSON with DuckDB

Understanding the Query

Performance Benefits

Looking Ahead

Conclusion

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

社区洞察