Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

In today's fast-paced data-driven world, organizations are constantly looking for scalable and cost-effective solutions to handle large datasets. AWS Lambda, combined with DuckDB in a Docker container, offers an ideal architecture for querying massive datasets stored in formats like Parquet, Hudi, Iceberg, and Delta. This combination enables fast, on-the-fly querying of data with minimal overhead, leveraging serverless architecture for extreme scalability.

In this blog, I will guide you step-by-step on how to set up and use DuckDB within an AWS Lambda Docker container, enabling you to run efficient, in-memory SQL queries against large datasets—whether you're dealing with millions of rows or even billions. By the end, you'll be able to query your datasets without the high costs and limitations typically associated with traditional ETL pipelines or big data services.

Solution Overview


Why DuckDB on AWS Lambda?

AWS Lambda provides a serverless architecture where you only pay for the compute time you use. This model is ideal for dynamic, variable workloads that do not require a constant running server. DuckDB, an in-memory database designed for fast, analytical queries, makes this combination even more powerful. Here are a few reasons why:

  1. Cost-Effective: With Lambda, you're charged based on execution time. DuckDB can perform fast in-memory queries, reducing the time your Lambda functions need to run, meaning you'll only pay for minimal compute resources.
  2. Scalability: AWS Lambda scales automatically to handle large volumes of concurrent requests. DuckDB runs an isolated instance for each Lambda function invocation, meaning it can handle thousands of concurrent queries without any performance degradation.
  3. Ease of Use: DuckDB allows you to query Parquet, Hudi, Iceberg, and Delta formats natively with minimal configuration, reducing the need for complex data pipelines or additional services.


What You'll Need

  • AWS Account: To deploy Lambda functions.
  • Docker: To create the Lambda container.
  • DuckDB: The powerful database engine for fast querying.
  • Data in Parquet, Hudi, Iceberg, or Delta: Stored in an accessible location like S3.
  • IAM Permissions: To access your S3 data and execute Lambda functions.

Hands on Labs

Lets setup and upload Iceberg hudi and delta and parquet Files on S3 for test use cases

We will upload this to S3 with upload_data.sh


Setting Up DuckDB in AWS Lambda Docker Container

n the Dockerfile, we're using a base Python image for Lambda, copying dependencies from a requirements.txt, and setting the Lambda function's entry point with the CMD directive.

Lambda Function Code

Here’s an example of a Lambda function that runs DuckDB queries. The function can be triggered by an API call, executing SQL against datasets stored in S3 (e.g., Parquet, Hudi, Iceberg, or Delta format).



This function accepts SQL queries from an event, and depending on the query type (Parquet, Hudi, etc.), it loads the appropriate data and executes it using DuckDB.

Handling Different Data Formats (Parquet, Hudi, Iceberg, Delta)

The real power of DuckDB comes in its ability to work seamlessly with various formats like Parquet, Hudi, Iceberg, and Delta:

  • Parquet: DuckDB can query Parquet files directly using the read_parquet extension.
  • Hudi: Use the Hudi extension to read from Hudi tables and execute queries on them.
  • Iceberg: The Iceberg extension allows you to query Iceberg tables and access their metadata.
  • Delta: DuckDB also supports the Delta format, allowing efficient querying of Delta Lake tables.

Testing Locally with Docker

Test for ICEBERG

Output

Test for Parquet

Output

Delta

Output

Unit Test File with Hudi | Iceberg | Delta | Parquet test_duckdb_lambda.py

Test Results

Infra code

Deploy stack pls deploy

Why This is a Great Solution

With this approach, you can leverage the 10GB of memory available to Lambda functions and efficiently query large datasets stored in S3, all with minimal cost. DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation.

Using AWS Lambda's serverless architecture, you only pay for the actual compute time your function uses. This allows you to build scalable, cost-effective data query solutions without the overhead of provisioning and maintaining infrastructure.

Code

https://github.com/soumilshah1995/duckdb-on-lambda/blob/main/README.md


Conclusion

By combining DuckDB with AWS Lambda in a Docker container, you unlock an incredibly cost-effective and scalable querying solution for your data. Whether you're working with Parquet, Hudi, Iceberg, or Delta formats, this architecture offers unparalleled flexibility and performance, all while reducing the complexity and cost of traditional data processing solutions.

This hands-on approach demonstrates how easy it is to set up and scale your data queries in the cloud, helping you unlock the full potential of your data with minimal effort.

Happy querying!

References

ttps://juhache.substack.com/p/exploring-duckdb-aws-lambda

https://medium.com/@kkyon/profile-a-data-lake-built-with-aws-lambda-and-duckdb-2fc810ff9f4d

https://tobilg.com/using-duckdb-in-aws-lambda


Rémi Dettai

Data and Cloud Engineer

3 个月

I think this lacks an actual cost analysis. AWS Lambda is typically 10x more expensive than EC2 for the same compute, so this solution might actually be VERY expensive for large datasets.

回复
Romain Ferraton

CEO and Founder @ Architecture & Performance | Performance Tuning, Business Intelligence

3 个月

You might use a duckdb file instead of inmemory only (/tmp/dummy.duckdb instead of :inmemory) this will avoid you problem about memory size ( and eventually reduce cost by having smaller lambda) Duckdb is NOT limited to memory size if it have a file as storage. If you plan is just to convert (csv to parquet ou delta/iceberg/hudi), without memory intensive fonctions, duckdb will stream and will use minimal memory ( depending on rowgroup size)

Kishore Panda

Lead Data/Cloud Solutions Architect| Lead Data engineer (AWS &Azure) | MUREX MLC | Collateral | MXML | Kafka/ MSK | Spark | Snowflake | Flink | Informatica | Glue | Redshift | Mongo | Dynamo DB

3 个月

Insightful. Very interesting. I have a problem how to solve usinng this. One parent bucket has 10 sub buckets for 10 tables folders. Under 10 each sub bucket there are date partitioned folder. Lets asume data arived in 1 date folder for 1 table. I need to take related data present in other table sub dated folder. Is there a way it can be done without traversing all folders and partitions in other table folders

回复
Humaid Kidwai

University of Calgary | Google Summer of Code '24 | Samsung Research

3 个月

"DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation" I don't think DuckDB in Lambda can handle analytics on billions of rows stored in Lakehouses. Have you tried running any tests at this scale?

回复
Yaron Sofer

Software Engineer at BMC Software | backend engineering (data oriented) | data analytics enthusiast | AWS l B.Sc. Computer Science

3 个月

Very nice post! According your post, do you suggest to query iceberg tables that their data hosted in s3 in parquet file format using DuckDB instead of Athena (which reduce additional cost for Athena service’s payment)? Soumil S.

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了