Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |
In today's fast-paced data-driven world, organizations are constantly looking for scalable and cost-effective solutions to handle large datasets. AWS Lambda, combined with DuckDB in a Docker container, offers an ideal architecture for querying massive datasets stored in formats like Parquet, Hudi, Iceberg, and Delta. This combination enables fast, on-the-fly querying of data with minimal overhead, leveraging serverless architecture for extreme scalability.
In this blog, I will guide you step-by-step on how to set up and use DuckDB within an AWS Lambda Docker container, enabling you to run efficient, in-memory SQL queries against large datasets—whether you're dealing with millions of rows or even billions. By the end, you'll be able to query your datasets without the high costs and limitations typically associated with traditional ETL pipelines or big data services.
Solution Overview
Why DuckDB on AWS Lambda?
AWS Lambda provides a serverless architecture where you only pay for the compute time you use. This model is ideal for dynamic, variable workloads that do not require a constant running server. DuckDB, an in-memory database designed for fast, analytical queries, makes this combination even more powerful. Here are a few reasons why:
What You'll Need
Hands on Labs
Lets setup and upload Iceberg hudi and delta and parquet Files on S3 for test use cases
We will upload this to S3 with upload_data.sh
Setting Up DuckDB in AWS Lambda Docker Container
n the Dockerfile, we're using a base Python image for Lambda, copying dependencies from a requirements.txt, and setting the Lambda function's entry point with the CMD directive.
Lambda Function Code
Here’s an example of a Lambda function that runs DuckDB queries. The function can be triggered by an API call, executing SQL against datasets stored in S3 (e.g., Parquet, Hudi, Iceberg, or Delta format).
This function accepts SQL queries from an event, and depending on the query type (Parquet, Hudi, etc.), it loads the appropriate data and executes it using DuckDB.
Handling Different Data Formats (Parquet, Hudi, Iceberg, Delta)
The real power of DuckDB comes in its ability to work seamlessly with various formats like Parquet, Hudi, Iceberg, and Delta:
Testing Locally with Docker
领英推荐
Test for ICEBERG
Output
Test for Parquet
Output
Delta
Output
Unit Test File with Hudi | Iceberg | Delta | Parquet test_duckdb_lambda.py
Test Results
Infra code
Deploy stack pls deploy
Why This is a Great Solution
With this approach, you can leverage the 10GB of memory available to Lambda functions and efficiently query large datasets stored in S3, all with minimal cost. DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation.
Using AWS Lambda's serverless architecture, you only pay for the actual compute time your function uses. This allows you to build scalable, cost-effective data query solutions without the overhead of provisioning and maintaining infrastructure.
Code
Conclusion
By combining DuckDB with AWS Lambda in a Docker container, you unlock an incredibly cost-effective and scalable querying solution for your data. Whether you're working with Parquet, Hudi, Iceberg, or Delta formats, this architecture offers unparalleled flexibility and performance, all while reducing the complexity and cost of traditional data processing solutions.
This hands-on approach demonstrates how easy it is to set up and scale your data queries in the cloud, helping you unlock the full potential of your data with minimal effort.
Happy querying!
References
Data and Cloud Engineer
3 个月I think this lacks an actual cost analysis. AWS Lambda is typically 10x more expensive than EC2 for the same compute, so this solution might actually be VERY expensive for large datasets.
CEO and Founder @ Architecture & Performance | Performance Tuning, Business Intelligence
3 个月You might use a duckdb file instead of inmemory only (/tmp/dummy.duckdb instead of :inmemory) this will avoid you problem about memory size ( and eventually reduce cost by having smaller lambda) Duckdb is NOT limited to memory size if it have a file as storage. If you plan is just to convert (csv to parquet ou delta/iceberg/hudi), without memory intensive fonctions, duckdb will stream and will use minimal memory ( depending on rowgroup size)
Lead Data/Cloud Solutions Architect| Lead Data engineer (AWS &Azure) | MUREX MLC | Collateral | MXML | Kafka/ MSK | Spark | Snowflake | Flink | Informatica | Glue | Redshift | Mongo | Dynamo DB
3 个月Insightful. Very interesting. I have a problem how to solve usinng this. One parent bucket has 10 sub buckets for 10 tables folders. Under 10 each sub bucket there are date partitioned folder. Lets asume data arived in 1 date folder for 1 table. I need to take related data present in other table sub dated folder. Is there a way it can be done without traversing all folders and partitions in other table folders
University of Calgary | Google Summer of Code '24 | Samsung Research
3 个月"DuckDB's fast in-memory processing means you can query datasets with billions of rows without worrying about high costs or performance degradation" I don't think DuckDB in Lambda can handle analytics on billions of rows stored in Lakehouses. Have you tried running any tests at this scale?
Software Engineer at BMC Software | backend engineering (data oriented) | data analytics enthusiast | AWS l B.Sc. Computer Science
3 个月Very nice post! According your post, do you suggest to query iceberg tables that their data hosted in s3 in parquet file format using DuckDB instead of Athena (which reduce additional cost for Athena service’s payment)? Soumil S.