Debating the Data Stack: Why We Chose SQLMesh Over dbt (and DuckDB in the Mix)
Database Tycoon
A full-service data engineering consultancy with an open-source-first ethos
In 2024, Database Tycoon took on an ambitious project - build a serverless data lake using only open-source tools, fueled by CSV uploads to AWS. This stack would be open-sourced and released as a public good for all. Who would sponsor a project like this?
The client? One of the world’s leading intergovernmental organizations committed to international peace and sustainable development.
Their relatively small office would prototype the app and take ownership of the design. The goal was to let their team run it for some time, contribute their work, and prove that the model worked. Then, with a scalable process in hand, Mirian Lima, the organization's data scientist and researcher, would recruit a team of volunteers, interns, other staff, etc. to contribute. With the right documentation in hand, this could grow into an organic, (mostly) self-governing platform for sharing research knowledge.
Both the tech and the process would be difficult to build. But despite the challenge ahead, we took solace in the fact that we were working with a stakeholder who was ready for it. Mirian issued the original mandate for the project. She picked this stack to align with her goals - improved data access for the public and a standardized framework for publishing data transformations. Her team spends countless hours downloading datasets from NGOs such as The WHO and The IMF, transforming CSVs in Excel, and often re-doing that work over and over again to reproduce research. With this framework, the work only needs to be done once. If the analyst adds their Ibis code to the SQLMesh project, it will be available for future researchers as a resource.
With this mission in mind, we set out to build. We recruited engineers and Pedro Heyerdahl as data engineers and set out to pick the stack tools. Here’s where we landed:
These decisions came with a long list of pros and cons. On one hand, these are cutting-edge products that are laser-focused on solving these exact data problems. The communities are tight-knit and active, quickly responding to technical issues and bug reports (Tobiko even has an AI-support channel in Slack now, and it’s pretty good!).
However, the downsides shouldn’t be ignored. These tools are for the most part, untested. At least not in the same types of high-stakes, long-term environments that other engines like Snowflake and dbt have been tested in. These tools may break, and they may stay broken. There’s also the very practical issue of bridging the skills gap, as most of our team hadn’t worked directly with these tools before. We’d be learning as we built, which always increases the project chaos indicator (PCI). Tools like the ones mentioned earlier are familiar to us - straightforward, and simple to set up. In consulting, the time to build is a critical factor in any project. It’s directly tied to scoping and budgeting. It’s usually very costly to voluntarily add uncertainty to a project, so the benefits of using it must outweigh the costs.
Shortly after we started building, the team conducted a meeting to debate the stack we chose. We had some doubts about the benefits we would be getting from these tools. For example:
Stay tuned for the next article in this series, where we dig into that conversation: the topics we discussed, the debates, and the decisions made.
Assistant Vice President - Genpact
3 周Deep analysis on tool identification and moving away native tools and selecting upcoming tools...good decision
Senior Analytics Engineer at Brooklyn Data Co.
1 个月This is rad.
Snowflake + Iceberg & DuckDB → Peak Performance
1 个月Great article! Were you guys curious to try other query engines before settling on DuckDB?
Developer Relations Expert
1 个月?? WOW!
Data Engineer | Advisor
1 个月Very cool!