DuckDB Meets SQLMesh: Building a Lean Data Architecture For One of the World’s Leading Intergovernmental Organizations
Database Tycoon
A full-service data engineering consultancy with an open-source-first ethos
This post is a part of a series called Behind the Build. If you missed the first post, check it out here.
In the fall of 2024, my team at Database Tycoon took on a project to build an open-source data lakehouse for a prominent global humanitarian organization. We had strict requirements that would prevent us from procuring software from any outside vendors, other than AWS and GitHub, which were already established on the team. However, open-source tools would be fine to use. Based on these requirements, we chose DuckDB, SQLMesh, S3, and GitHub Actions to build our pipeline.
Our stakeholder Mirian is a data scientist and researcher at the organization. She specified several key requirements: open-source only, minimizing costs by avoiding long-running servers, infrastructure limited to AWS, and support for Python transformations.
After working with these tools for a few weeks, I brought my team together to discuss some important considerations. In the article below, I'll walk you through the discussions we had and the decisions we made to get the project finished in time. It was a great chance to experiment with some of the most discussed tools in the data space and get to see their strengths and weaknesses in a real scenario.
To Iceberg or not to Iceberg
Our project showcases a modern approach to data management by combining two key concepts: data lakes and data warehouses. A data lake is a centralized repository that can store vast amounts of raw, unstructured data in its native format. A data lakehouse builds upon this foundation by adding warehouse-like features—structured storage, data management, and analytics capabilities.
We used AWS S3 to create our flexible, scalable data lake. On top of this foundation, we built a lakehouse architecture that adds structure and analytical capabilities. DuckDB serves as our query engine, providing fast SQL operations directly against S3 data—particularly for columnar formats like Parquet. SQLMesh orchestrates our data transformations, letting us define pipelines in SQL and manage dependencies. This approach combines the cost-effectiveness and scalability of a data lake with the performance and structure needed for efficient analytics.
Since this discussion centered on S3, Apache Iceberg became a natural consideration. Iceberg is a file format that adds a metadata layer to S3 files, enabling other services to treat these files as structured database tables. This setup requires a metadata catalog to process queries.
At this project stage, we were using Parquet as our file format because it seamlessly integrates with both DuckDB and SQLMesh. DuckDB can query Parquet files directly from S3 and export them through SQLMesh post-hooks.
Until now, this setup had been working well. However, to implement Iceberg, we'd need to make two major changes to the project:
Iceberg's main appeal was its powerful state management capabilities. The metadata layer enables query engines to perform "time travel" and access data from specific points in time—a feature that would help their team reproduce analytics from exact timestamps.
We encountered a significant limitation, though: DuckDB can only read Iceberg format, not write to it. This restriction undermined one of Iceberg's key benefits—tracking changes in output models over time. Other query engines already support Iceberg, but at this stage in the project, switching to a different query engine wasn't feasible.
I prompted the team to reevaluate our versioning approach. The complexity of implementing Iceberg outweighed its benefits. Since we were already using Parquet files on S3, we opted for a simpler solution: we enabled native file versioning on the bucket. This approach allowed us to maintain the change history and allows users to access previous versions of files, albeit through a more manual process. We've shelved the Iceberg/catalog discussion until DuckDB adds write support. We're hoping this gets ticket gets resolved soon!
SQLMesh Environments
SQLMesh is a modern data transformation framework designed to streamline the development, testing, and deployment of SQL-based data pipelines. It provides features like virtual data environments, automatic data versioning, and efficient incremental processing, making it a powerful alternative to traditional transformation tools. Within our architecture, SQLMesh was intended to help manage data environments dynamically, allowing for seamless experimentation and iteration without impacting production datasets.
After discussing file versioning, we tackled our next challenge: environments. We had planned to use SQLMesh virtual data environments for efficient runs, but we quickly encountered a self-imposed problem—we weren't storing any of the critical environment metadata.
The issue arose because we weren’t using a cloud data warehouse (e.g. Snowflake) with a centralized metadata catalog, and were only saving downstream output files in S3. We couldn’t take advantage of the SQLMesh virtual data environments. It needs this metadata to understand the state and location of each environment in the project.
This challenge stemmed from our decision to avoid a data warehouse solution. Traditional data warehouses store metadata alongside the data itself, but we took a different approach.
With each run, our DuckDB setup required rebuilding the database from scratch—creating a new instance, loading data from S3, performing transformations, and discarding the instance. This meant we couldn't preserve any metadata between runs.
After a quick chat with the Tobiko team for some guidance, we tried setting up a small PostgresDB instance for metadata storage. However, this solution didn't last long for two key reasons:
Our final solution was crudely simple: store the entire DuckDB file on S3. Since our project was small in scale, S3 costs stayed minimal. Though this approach might seem basic, it has proven to be an effective alternative to a remote warehouse server. The pipeline simply loads the database file, performs its operations, and writes it back to S3—a practical solution that fits perfectly within our system's constraints.
Back to the Infrastructure
Our exploration revealed two critical challenges:
Given these constraints, we refocused on the project's core requirements:
Though we had to set aside some promising features due to our toolset's limitations, we prioritized practical solutions. The team is proud of our work, and we're confident that future SQLMesh and DuckDB updates will make this architecture even stronger.
Switching Tools?
At this point in the project, we couldn't switch tools—we were too far along. However, given what I know now, here's what I might have done differently from the start:
Final Thoughts
Working with cutting-edge tools like SQLMesh and DuckDB has revealed both their impressive capabilities and current limitations. We're excited to see if future updates will address these challenges, making this architecture even more powerful.
Stay tuned for my next post where I'll share our final results! After three months of careful planning and development, we've created a foundation that will power their organization-wide data-gathering initiative.
Data Engineering | Python | Spark | SQL | AWS
2 周Hi, great article. You had mentioned in the first article that this stack would be open sourced. Will that be released with the final article in the series?