Unleash the power of SQLMesh
In the current data engineering space we many tools which are trending (duckdb, dbt, airbyte, polars, datafusion etc), mainly because of the ease of implementation, open source and community supported. dbt is one such tool which helps in doing the data transformation using templating but dbt is not the first tool to implement it. Informatica did something similar back then for transformation. But dbt is a complete package covering cli option, tests, DRY promoter, lineage
Like every other tool dbt also comes with some short comings
- Too much use of jinja templating will hide the original sql
- Managing states and dependencies is not possible in dbt
I see a trend in the data engineering ecosystem where you build something open source and also build on top of it a paid version. Also you mingle well with your brothers and sisters (Other tools) in the data engineering space
I will be calling the SQLMesh as SM going forward in the article
Understanding SM's working
Create models?- Works similar to how dbt models work, logics can be written in sql or python
Plans?- This is where it starts getting interesting. When we intent to make a change to a model it can have two effect
- logical implication of the change
- computational requirements of the change
plan?command identifies all the direct and indirect changes that the code change is causing and this can be run environment specific. In some of the projects what we see is like developer is doing code changes at lightning speed but forgets to do a check on computational impact. Code is deployed to prod and developer realizes there is a increase in runtime. This can be addressed by SM
Apply?-?apply?command will be applying the change which showed in the plan. Computation will be based on the change that we introduced
ex -?
If the change is a column addition on a model and that column is not used any where then only that model will be affected. Backfill will be done only for that model
In case if the change is breaking one like calculation logic change on a column and then column is referred in multiple models then we will have to backfill all the impacted models
This is super clean right? In terraform world once a plan is generated it would be taken to a meeting discussion and get everyone's approval based on the generated plan. So everyone will be on the same page. SM is similar but for the data team
Virtualenv -?Data engineers always complain about data not being similar to a production environment and want a copy of data from production environment to test effectively. This problem is addressed by SM with the use of?virtual data environment?. What this means is that we dont copy the data to non prod environment but you just create a virtual layer pointing to the prod environment. So the data you will be using to test will actually be the prod data. How cool is that ?
SM also has a way to define Tests and Audits. There was also a mention of using duckdb for local testing and Snowflake for prod environment. This is possible by the SM since it has a module SQLGlot which is a sql transpiler, so dialect to dialect sql change is possible. Adding the link for sqlglot
Some of the features which I think is very good are
领英推è
- State managed
- Column level and table lineage without us using the ref functions. Lineage graph also showsthe corresponding CTE which is responsible for column transformation
- Product like environment for dev, which uses the views on top of prod data
- Comments convert as table and column docs which is cool, no need to write painful yaml files
- Segregate the change into breaking and non breaking change with a proposal to backfill. Also we can easily understand what kind of computational change we are bringing into the system
- Unit tests can be done at CTE level
- They have an adapter which will work on existing dbt projects. But there are some modifications that needs to be done for jinja code (macros)
Other fancy features
- SM has a CLI, Notebook, Python APIs and also a good looking UI
- SM maintains its own metadata in the warehouse
- Airflow integration
- CICD integration
I will be trying out SM's dbt adapter. Lets see how good the compatibility looks like
Step 1 - Clone the below repo and follow the instructions on the github readme
Step 2 - Install SM with the ui option. SM's install instruction is clear
Step 3 - For the dbt project to work with SM. We will have to add the start_date in the dbt_project.yaml. This will be used for the backfills
Step 4 - Launch the SM ui using sqlmesh ui command. Since the project was very simple and did not have many dbt plugins/macros no error popped up
Step 5 - Jumping to virtual env feature. Created a new environment?dev?and in stg_orders just applied a filter condition. This will be a breaking change lets look at the plan and apply it on dev environment
When editing the model I was looking at the lineage that is displayed below. Wow it had column level lineage
Step 6 - Straight we can select the environment from plan dropdown as prod and again apply it. Thats it we have promoted the change to prod
Conclusion
Though it looked very promising, I would for now stick with dbt. Mainly because of the community support and we have a lot of tooling around it. But may be in future we will have the SM in a matured state then I would be happy to work on a migration project :D
Senior Data Engineer | Azure Databricks, HiveQL, Databases,SQL, Python, Spark
8 个月Virtual environment is really cool , looking forward to see it in Matured state. Thanks for sharing a worthy read