登录查看更多内容

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2024年12月2日

In today's fast-paced data-driven world, organizations are constantly looking for scalable and cost-effective solutions to handle large datasets. AWS Lambda, combined with DuckDB in a Docker container, offers an ideal architecture for querying massive datasets stored in formats like Parquet, Hudi, Iceberg, and Delta. This combination enables fast, on-the-fly querying of data with minimal overhead, leveraging serverless architecture for extreme scalability.

In this blog, I will guide you step-by-step on how to set up and use DuckDB within an AWS Lambda Docker container, enabling you to run efficient, in-memory SQL queries against large datasets—whether you're dealing with millions of rows or even billions. By the end, you'll be able to query your datasets without the high costs and limitations typically associated with traditional ETL pipelines or big data services.

Solution Overview

Why DuckDB on AWS Lambda?

AWS Lambda provides a serverless architecture where you only pay for the compute time you use. This model is ideal for dynamic, variable workloads that do not require a constant running server. DuckDB, an in-memory database designed for fast, analytical queries, makes this combination even more powerful. Here are a few reasons why:

Cost-Effective: With Lambda, you're charged based on execution time. DuckDB can perform fast in-memory queries, reducing the time your Lambda functions need to run, meaning you'll only pay for minimal compute resources.
Scalability: AWS Lambda scales automatically to handle large volumes of concurrent requests. DuckDB runs an isolated instance for each Lambda function invocation, meaning it can handle thousands of concurrent queries without any performance degradation.
Ease of Use: DuckDB allows you to query Parquet, Hudi, Iceberg, and Delta formats natively with minimal configuration, reducing the need for complex data pipelines or additional services.

What You'll Need

AWS Account: To deploy Lambda functions.
Docker: To create the Lambda container.
DuckDB: The powerful database engine for fast querying.
Data in Parquet, Hudi, Iceberg, or Delta: Stored in an accessible location like S3.
IAM Permissions: To access your S3 data and execute Lambda functions.

Hands on Labs

Lets setup and upload Iceberg hudi and delta and parquet Files on S3 for test use cases

We will upload this to S3 with upload_data.sh

Setting Up DuckDB in AWS Lambda Docker Container

n the Dockerfile, we're using a base Python image for Lambda, copying dependencies from a requirements.txt, and setting the Lambda function's entry point with the CMD directive.

Lambda Function Code

Here’s an example of a Lambda function that runs DuckDB queries. The function can be triggered by an API call, executing SQL against datasets stored in S3 (e.g., Parquet, Hudi, Iceberg, or Delta format).

This function accepts SQL queries from an event, and depending on the query type (Parquet, Hudi, etc.), it loads the appropriate data and executes it using DuckDB.

Handling Different Data Formats (Parquet, Hudi, Iceberg, Delta)

The real power of DuckDB comes in its ability to work seamlessly with various formats like Parquet, Hudi, Iceberg, and Delta:

Parquet: DuckDB can query Parquet files directly using the read_parquet extension.
Hudi: Use the Hudi extension to read from Hudi tables and execute queries on them.
Iceberg: The Iceberg extension allows you to query Iceberg tables and access their metadata.
Delta: DuckDB also supports the Delta format, allowing efficient querying of Delta Lake tables.

Testing Locally with Docker

领英推荐

Seamless Integration: Databricks' Approach to Reading…

Akshay T. 1 年前

Sneak Peek into Trino with Azure HDInsight on AKS

Debananda Ghosh 1 年前

TDA#1: Amazon S3 Tables

Daniel Palma 3 个月前

Test for ICEBERG

Output

Test for Parquet

Output

Delta

Output

Unit Test File with Hudi | Iceberg | Delta | Parquet test_duckdb_lambda.py

Test Results

Infra code

Deploy stack pls deploy

Why This is a Great Solution

With this approach, you can leverage the 10GB of memory available to Lambda functions and efficiently query large datasets stored in S3, all with minimal cost. DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation.

Using AWS Lambda's serverless architecture, you only pay for the actual compute time your function uses. This allows you to build scalable, cost-effective data query solutions without the overhead of provisioning and maintaining infrastructure.

Code

https://github.com/soumilshah1995/duckdb-on-lambda/blob/main/README.md

Conclusion

By combining DuckDB with AWS Lambda in a Docker container, you unlock an incredibly cost-effective and scalable querying solution for your data. Whether you're working with Parquet, Hudi, Iceberg, or Delta formats, this architecture offers unparalleled flexibility and performance, all while reducing the complexity and cost of traditional data processing solutions.

This hands-on approach demonstrates how easy it is to set up and scale your data queries in the cloud, helping you unlock the full potential of your data with minimal effort.

Happy querying!

References

ttps://juhache.substack.com/p/exploring-duckdb-aws-lambda

https://medium.com/@kkyon/profile-a-data-lake-built-with-aws-lambda-and-duckdb-2fc810ff9f4d

https://tobilg.com/using-duckdb-in-aws-lambda

Rémi Dettai

Data and Cloud Engineer

3 个月

I think this lacks an actual cost analysis. AWS Lambda is typically 10x more expensive than EC2 for the same compute, so this solution might actually be VERY expensive for large datasets.

Romain Ferraton

CEO and Founder @ Architecture & Performance | Performance Tuning, Business Intelligence

3 个月

You might use a duckdb file instead of inmemory only (/tmp/dummy.duckdb instead of :inmemory) this will avoid you problem about memory size ( and eventually reduce cost by having smaller lambda) Duckdb is NOT limited to memory size if it have a file as storage. If you plan is just to convert (csv to parquet ou delta/iceberg/hudi), without memory intensive fonctions, duckdb will stream and will use minimal memory ( depending on rowgroup size)

1 次回应

Kishore Panda

3 个月

Insightful. Very interesting. I have a problem how to solve usinng this. One parent bucket has 10 sub buckets for 10 tables folders. Under 10 each sub bucket there are date partitioned folder. Lets asume data arived in 1 date folder for 1 table. I need to take related data present in other table sub dated folder. Is there a way it can be done without traversing all folders and partitions in other table folders

Humaid Kidwai

University of Calgary | Google Summer of Code '24 | Samsung Research

3 个月

"DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation" I don't think DuckDB in Lambda can handle analytics on billions of rows stored in Lakehouses. Have you tried running any tests at this scale?

Yaron Sofer

Software Engineer at BMC Software | backend engineering (data oriented) | data analytics enthusiast | AWS l B.Sc. Computer Science

3 个月

Very nice post! According your post, do you suggest to query iceberg tables that their data hosted in s3 in parquet file format using DuckDB instead of Athena (which reduce additional cost for Athena service’s payment)? Soumil S.

1 次回应

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Solution Overview

Why DuckDB on AWS Lambda?

What You'll Need

Hands on Labs

Setting Up DuckDB in AWS Lambda Docker Container

Lambda Function Code

Handling Different Data Formats (Parquet, Hudi, Iceberg, Delta)

领英推荐

Test for ICEBERG

Why This is a Great Solution

Code

Conclusion

References

Soumil S.的更多文章

社区洞察

其他会员也浏览了

The 3 Most Important Concepts To Design A Scalable DynamoDB Table

Understanding DynamoDB’s scaling features in the console

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

Mastering Azure Cosmos DB with ASP.NET Core

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue

Amazon Athena vs. Traditional CSV Analysis: Why Serverless Querying Wins in Modern Data Workflows

Key AWS re:Invent 2024 Announcements in the Data Space for Data Engineers

Solution Overview

Why DuckDB on AWS Lambda?

What You'll Need

Hands on Labs

Setting Up DuckDB in AWS Lambda Docker Container

Lambda Function Code

Handling Different Data Formats (Parquet, Hudi, Iceberg, Delta)

领英推荐

Test for ICEBERG

Why This is a Great Solution

Code

Conclusion

References

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

The 3 Most Important Concepts To Design A Scalable DynamoDB Table

Understanding DynamoDB’s scaling features in the console

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

Mastering Azure Cosmos DB with ASP.NET Core

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue

Amazon Athena vs. Traditional CSV Analysis: Why Serverless Querying Wins in Modern Data Workflows

Key AWS re:Invent 2024 Announcements in the Data Space for Data Engineers