Unleash the power of SQLMesh

Unleash the power of SQLMesh

In the current data engineering space we many tools which are trending (duckdb, dbt, airbyte, polars, datafusion etc), mainly because of the ease of implementation, open source and community supported. dbt is one such tool which helps in doing the data transformation using templating but dbt is not the first tool to implement it. Informatica did something similar back then for transformation. But dbt is a complete package covering cli option, tests, DRY promoter, lineage

Like every other tool dbt also comes with some short comings

  • Too much use of jinja templating will hide the original sql
  • Managing states and dependencies is not possible in dbt

I see a trend in the data engineering ecosystem where you build something open source and also build on top of it a paid version. Also you mingle well with your brothers and sisters (Other tools) in the data engineering space

I will be calling the SQLMesh as SM going forward in the article

Understanding SM's working

Create models?- Works similar to how dbt models work, logics can be written in sql or python

Plans?- This is where it starts getting interesting. When we intent to make a change to a model it can have two effect

  • logical implication of the change
  • computational requirements of the change

plan?command identifies all the direct and indirect changes that the code change is causing and this can be run environment specific. In some of the projects what we see is like developer is doing code changes at lightning speed but forgets to do a check on computational impact. Code is deployed to prod and developer realizes there is a increase in runtime. This can be addressed by SM

Apply?-?apply?command will be applying the change which showed in the plan. Computation will be based on the change that we introduced

ex -?

If the change is a column addition on a model and that column is not used any where then only that model will be affected. Backfill will be done only for that model

In case if the change is breaking one like calculation logic change on a column and then column is referred in multiple models then we will have to backfill all the impacted models

This is super clean right? In terraform world once a plan is generated it would be taken to a meeting discussion and get everyone's approval based on the generated plan. So everyone will be on the same page. SM is similar but for the data team

Virtualenv -?Data engineers always complain about data not being similar to a production environment and want a copy of data from production environment to test effectively. This problem is addressed by SM with the use of?virtual data environment?. What this means is that we dont copy the data to non prod environment but you just create a virtual layer pointing to the prod environment. So the data you will be using to test will actually be the prod data. How cool is that ?

SM also has a way to define Tests and Audits. There was also a mention of using duckdb for local testing and Snowflake for prod environment. This is possible by the SM since it has a module SQLGlot which is a sql transpiler, so dialect to dialect sql change is possible. Adding the link for sqlglot

Some of the features which I think is very good are

  • State managed

  • Column level and table lineage without us using the ref functions. Lineage graph also showsthe corresponding CTE which is responsible for column transformation
  • Product like environment for dev, which uses the views on top of prod data
  • Comments convert as table and column docs which is cool, no need to write painful yaml files
  • Segregate the change into breaking and non breaking change with a proposal to backfill. Also we can easily understand what kind of computational change we are bringing into the system
  • Unit tests can be done at CTE level
  • They have an adapter which will work on existing dbt projects. But there are some modifications that needs to be done for jinja code (macros)

Other fancy features

  • SM has a CLI, Notebook, Python APIs and also a good looking UI
  • SM maintains its own metadata in the warehouse
  • Airflow integration
  • CICD integration

I will be trying out SM's dbt adapter. Lets see how good the compatibility looks like

Step 1 - Clone the below repo and follow the instructions on the github readme

https://github.com/dbt-labs/jaffle_shop_duckdb

Step 2 - Install SM with the ui option. SM's install instruction is clear

Step 3 - For the dbt project to work with SM. We will have to add the start_date in the dbt_project.yaml. This will be used for the backfills

Step 4 - Launch the SM ui using sqlmesh ui command. Since the project was very simple and did not have many dbt plugins/macros no error popped up

This is how the UI will look the 1st time on opening

Step 5 - Jumping to virtual env feature. Created a new environment?dev?and in stg_orders just applied a filter condition. This will be a breaking change lets look at the plan and apply it on dev environment

When editing the model I was looking at the lineage that is displayed below. Wow it had column level lineage

Lineage graph
When we click on plan the ui comes up and we can apply the change

Step 6 - Straight we can select the environment from plan dropdown as prod and again apply it. Thats it we have promoted the change to prod

Conclusion

Though it looked very promising, I would for now stick with dbt. Mainly because of the community support and we have a lot of tooling around it. But may be in future we will have the SM in a matured state then I would be happy to work on a migration project :D

Hemamalini Udayakumar

Senior Data Engineer | Azure Databricks, HiveQL, Databases,SQL, Python, Spark

8 个月

Virtual environment is really cool , looking forward to see it in Matured state. Thanks for sharing a worthy read

要查看或添加评论,请登录

Hariharan G的更多文章

  • Deep dive Apache Parquet

    Deep dive Apache Parquet

    As a data engineer everyone must have heard of parquet file format, which is most used when we use Apache Spark or any…

社区洞察

其他会员也浏览了