登录查看更多内容

DuckDB Meets SQLMesh: Building a Lean Data Architecture For One of the World’s Leading Intergovernmental Organizations

Database Tycoon

A full-service data engineering consultancy with an open-source-first ethos

发布日期: 2025年2月24日

This post is a part of a series called Behind the Build. If you missed the first post, check it out here.

In the fall of 2024, my team at Database Tycoon took on a project to build an open-source data lakehouse for a prominent global humanitarian organization. We had strict requirements that would prevent us from procuring software from any outside vendors, other than AWS and GitHub, which were already established on the team. However, open-source tools would be fine to use. Based on these requirements, we chose DuckDB, SQLMesh, S3, and GitHub Actions to build our pipeline.

Our stakeholder Mirian is a data scientist and researcher at the organization. She specified several key requirements: open-source only, minimizing costs by avoiding long-running servers, infrastructure limited to AWS, and support for Python transformations.

After working with these tools for a few weeks, I brought my team together to discuss some important considerations. In the article below, I'll walk you through the discussions we had and the decisions we made to get the project finished in time. It was a great chance to experiment with some of the most discussed tools in the data space and get to see their strengths and weaknesses in a real scenario.

To Iceberg or not to Iceberg

Our project showcases a modern approach to data management by combining two key concepts: data lakes and data warehouses. A data lake is a centralized repository that can store vast amounts of raw, unstructured data in its native format. A data lakehouse builds upon this foundation by adding warehouse-like features—structured storage, data management, and analytics capabilities.

We used AWS S3 to create our flexible, scalable data lake. On top of this foundation, we built a lakehouse architecture that adds structure and analytical capabilities. DuckDB serves as our query engine, providing fast SQL operations directly against S3 data—particularly for columnar formats like Parquet. SQLMesh orchestrates our data transformations, letting us define pipelines in SQL and manage dependencies. This approach combines the cost-effectiveness and scalability of a data lake with the performance and structure needed for efficient analytics.

How should we organize and name the files and folders in the bucket?
How can we store and retrieve historical versions of the data? Our stakeholder needed to access previous versions of datasets to reproduce analytics.
How do we structure the lake to support different environments (prod, dev, etc.)?

Since this discussion centered on S3, Apache Iceberg became a natural consideration. Iceberg is a file format that adds a metadata layer to S3 files, enabling other services to treat these files as structured database tables. This setup requires a metadata catalog to process queries.

At this project stage, we were using Parquet as our file format because it seamlessly integrates with both DuckDB and SQLMesh. DuckDB can query Parquet files directly from S3 and export them through SQLMesh post-hooks.

Until now, this setup had been working well. However, to implement Iceberg, we'd need to make two major changes to the project:

Convert all files from Parquet to Iceberg
Implement a metadata catalog while maintaining our no-servers mandate

Iceberg's main appeal was its powerful state management capabilities. The metadata layer enables query engines to perform "time travel" and access data from specific points in time—a feature that would help their team reproduce analytics from exact timestamps.

We encountered a significant limitation, though: DuckDB can only read Iceberg format, not write to it. This restriction undermined one of Iceberg's key benefits—tracking changes in output models over time. Other query engines already support Iceberg, but at this stage in the project, switching to a different query engine wasn't feasible.

I prompted the team to reevaluate our versioning approach. The complexity of implementing Iceberg outweighed its benefits. Since we were already using Parquet files on S3, we opted for a simpler solution: we enabled native file versioning on the bucket. This approach allowed us to maintain the change history and allows users to access previous versions of files, albeit through a more manual process. We've shelved the Iceberg/catalog discussion until DuckDB adds write support. We're hoping this gets ticket gets resolved soon!

SQLMesh Environments

SQLMesh is a modern data transformation framework designed to streamline the development, testing, and deployment of SQL-based data pipelines. It provides features like virtual data environments, automatic data versioning, and efficient incremental processing, making it a powerful alternative to traditional transformation tools. Within our architecture, SQLMesh was intended to help manage data environments dynamically, allowing for seamless experimentation and iteration without impacting production datasets.

After discussing file versioning, we tackled our next challenge: environments. We had planned to use SQLMesh virtual data environments for efficient runs, but we quickly encountered a self-imposed problem—we weren't storing any of the critical environment metadata.

The issue arose because we weren’t using a cloud data warehouse (e.g. Snowflake) with a centralized metadata catalog, and were only saving downstream output files in S3. We couldn’t take advantage of the SQLMesh virtual data environments. It needs this metadata to understand the state and location of each environment in the project.

This challenge stemmed from our decision to avoid a data warehouse solution. Traditional data warehouses store metadata alongside the data itself, but we took a different approach.

With each run, our DuckDB setup required rebuilding the database from scratch—creating a new instance, loading data from S3, performing transformations, and discarding the instance. This meant we couldn't preserve any metadata between runs.

After a quick chat with the Tobiko team for some guidance, we tried setting up a small PostgresDB instance for metadata storage. However, this solution didn't last long for two key reasons:

It went against our server-free project requirements and quickly became the most expensive component (using AWS RDS Aurora)
Without storing the actual data or using a shared catalog, we couldn't leverage the main benefits of environments—like testing changes against production or sharing datasets across environments

Our final solution was crudely simple: store the entire DuckDB file on S3. Since our project was small in scale, S3 costs stayed minimal. Though this approach might seem basic, it has proven to be an effective alternative to a remote warehouse server. The pipeline simply loads the database file, performs its operations, and writes it back to S3—a practical solution that fits perfectly within our system's constraints.

Back to the Infrastructure

Our exploration revealed two critical challenges:

DuckDB's lack of Iceberg write support prevented us from implementing robust versioning.
Without a persistent warehouse and metadata catalog, we couldn't fully utilize SQLMesh virtual data environments or effectively manage table versions and schema evolution.

Given these constraints, we refocused on the project's core requirements:

Keep costs low while maintaining simple project management
Enable ingestion and transformation of user-submitted CSVs to S3
Support Python-based transformations (specifically the Ibis library)

Though we had to set aside some promising features due to our toolset's limitations, we prioritized practical solutions. The team is proud of our work, and we're confident that future SQLMesh and DuckDB updates will make this architecture even stronger.

Switching Tools?

At this point in the project, we couldn't switch tools—we were too far along. However, given what I know now, here's what I might have done differently from the start:

Use a different query engine that has built-in Iceberg support instead of waiting for DuckDB to add this feature. This would allow us to support Iceberg files by self-hosting a catalog on AWS.
Consider dbt for transformations. In this case, SQLMesh's advanced features added unnecessary complexity to our project. Many of the features we used are also available in dbt. We could have used dbt to write Python models and export to S3 using libraries like dbt-DuckDB. Still, we kept SQLMesh because we believe it will continue to evolve and resolve the issues we encountered.

Final Thoughts

Working with cutting-edge tools like SQLMesh and DuckDB has revealed both their impressive capabilities and current limitations. We're excited to see if future updates will address these challenges, making this architecture even more powerful.

Stay tuned for my next post where I'll share our final results! After three months of careful planning and development, we've created a foundation that will power their organization-wide data-gathering initiative.

Lam Pham

Data Engineering | Python | Spark | SQL | AWS

2 周

Hi, great article. You had mentioned in the first article that this stack would be open sourced. Will that be released with the final article in the series?

1 次回应

查看更多评论

要查看或添加评论，请登录

Database Tycoon的更多文章

See all articles

To Iceberg or not to Iceberg

SQLMesh Environments

Back to the Infrastructure

Switching Tools?

Final Thoughts

Database Tycoon的更多文章

Lightdash

DataKitchen Review: Automating DataOps with Smart Testing & Observability

Debating the Data Stack: Why We Chose SQLMesh Over dbt (and DuckDB in the Mix)

What Are SDF Labs and Quary? A Dive Into Last Week’s Big Data Acquisitions

What Is Database Tycoon?