Debating the Data Stack: Why We Chose SQLMesh Over dbt (and DuckDB in the Mix)

Debating the Data Stack: Why We Chose SQLMesh Over dbt (and DuckDB in the Mix)

In 2024, Database Tycoon took on an ambitious project - build a serverless data lake using only open-source tools, fueled by CSV uploads to AWS. This stack would be open-sourced and released as a public good for all. Who would sponsor a project like this?

The client? One of the world’s leading intergovernmental organizations committed to international peace and sustainable development.

Their relatively small office would prototype the app and take ownership of the design. The goal was to let their team run it for some time, contribute their work, and prove that the model worked. Then, with a scalable process in hand, Mirian Lima, the organization's data scientist and researcher, would recruit a team of volunteers, interns, other staff, etc. to contribute. With the right documentation in hand, this could grow into an organic, (mostly) self-governing platform for sharing research knowledge.

Both the tech and the process would be difficult to build. But despite the challenge ahead, we took solace in the fact that we were working with a stakeholder who was ready for it. Mirian issued the original mandate for the project. She picked this stack to align with her goals - improved data access for the public and a standardized framework for publishing data transformations. Her team spends countless hours downloading datasets from NGOs such as The WHO and The IMF, transforming CSVs in Excel, and often re-doing that work over and over again to reproduce research. With this framework, the work only needs to be done once. If the analyst adds their Ibis code to the SQLMesh project, it will be available for future researchers as a resource.

With this mission in mind, we set out to build. We recruited engineers and Pedro Heyerdahl as data engineers and set out to pick the stack tools. Here’s where we landed:

  • SQLMesh for transformations: We chose Tobiko’s flagship product SQLMesh to run transformations primarily for its strong support for Python-style transformations. At the UN, many of the analysts are trained in ML/AI techniques and perform less in SQL and more in frameworks like Pandas and Ibis. We also chose it for its innovative state and environmental features.
  • DuckDB for the query engine: The primary data source for this project would be AWS S3. We picked DuckDB as the query engine because of its ease of support for connecting to an S3 backend. Selecting data from a parquet file with DuckDB couldn’t be simpler. Otherwise, it was top of mind when considering open-source, serverless query engines.
  • Ibis for the query language: This was a preference from Mirian’s team, who primarily work in this Python framework. Its support for both DuckDB and SQLMesh made it a natural pick for the project.

These decisions came with a long list of pros and cons. On one hand, these are cutting-edge products that are laser-focused on solving these exact data problems. The communities are tight-knit and active, quickly responding to technical issues and bug reports (Tobiko even has an AI-support channel in Slack now, and it’s pretty good!).

However, the downsides shouldn’t be ignored. These tools are for the most part, untested. At least not in the same types of high-stakes, long-term environments that other engines like Snowflake and dbt have been tested in. These tools may break, and they may stay broken. There’s also the very practical issue of bridging the skills gap, as most of our team hadn’t worked directly with these tools before. We’d be learning as we built, which always increases the project chaos indicator (PCI). Tools like the ones mentioned earlier are familiar to us - straightforward, and simple to set up. In consulting, the time to build is a critical factor in any project. It’s directly tied to scoping and budgeting. It’s usually very costly to voluntarily add uncertainty to a project, so the benefits of using it must outweigh the costs.

Shortly after we started building, the team conducted a meeting to debate the stack we chose. We had some doubts about the benefits we would be getting from these tools. For example:

  • Will we be able to utilize the benefits of SQLMesh’s state management and environments if we’re using a stateless architecture?
  • DuckDB is lacking support for Iceberg, which was identified as a stretch goal for the project. Are we delaying that indefinitely by picking DuckDB?

Stay tuned for the next article in this series, where we dig into that conversation: the topics we discussed, the debates, and the decisions made.

Deep analysis on tool identification and moving away native tools and selecting upcoming tools...good decision

Darcy Norman

Senior Analytics Engineer at Brooklyn Data Co.

1 个月

This is rad.

Arsham Eslami ???

Snowflake + Iceberg & DuckDB → Peak Performance

1 个月

Great article! Were you guys curious to try other query engines before settling on DuckDB?

Marisa Smith, PhD

Developer Relations Expert

1 个月

?? WOW!

Daniel Palma

Data Engineer | Advisor

1 个月

Very cool!

要查看或添加评论,请登录

Database Tycoon的更多文章