登录查看更多内容

SQLMesh: The future of DataOps

Tobias (Toby) Mao

Co-Founder and CTO @ Tobiko Data

发布日期: 2023年3月28日

Hi, I’m Toby, cofounder and CTO of Tobiko .?

We are excited to share SQLMesh, an open-source DataOps framework that brings the benefits of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python.?

Our team’s experiences at Airbnb, Apple, Google, and Netflix led us to build SQLMesh, a framework that goes beyond simply transforming data.?

Why do we care?

Once companies realized that “data is the new oil,” they raced to collect and manage tons of it. The rush gave birth to the “Big Data” movement, which led to the creation of fast engines and optimized warehouses that promised to magically make data valuable.?

Despite all the advanced tools, more often than not you’ll find a big oil spill in the middle of it all: duplicated tables, mysteriously interlinked pipelines, and business logic obfuscated in Jinjafied SQL, all held together by duct-taped Airflow DAGs.?

While data teams keep everything running, they are slowed down by fear of change because any tweak could cause the house of cards to come crashing down.

Struggling with change

Here are some challenges that we’ve faced, which I’m sure many data teams will find familiar. These pain points become more acute when data sizes increase or the number of data users expands:

Data pipelines are fragmented and fragile: At Netflix, I remember receiving a Slack message from an exec asking why their dashboard looked off. I was confused because I hadn’t made any recent changes to my jobs. After hours of digging, I finally found an upstream table that had been modified by someone else. They weren’t aware my pipeline depended on the table so didn’t tell me about the change.

Data quality checks are not sufficient: Over the weekend, I studied the upstream code and eventually came up with a fix. However, I didn’t own the pipeline and wasn’t familiar with their deployment process or the possible unintended consequences my “fix” might have on other pipelines. Because the pipeline only had data quality checks, the only way to validate my changes was by backfilling them, which would cost both time and money.

It’s too hard and too costly to build staging environments for data: Additionally, I didn’t have an easy way to deploy a staging environment so I could manually validate and verify the changes. Testing my change against the entire graph of dependencies would have taken massive coordination with many teams, which wasn’t feasible.

Silos transform data lakes to data swamps: Instead of risking the other team’s pipeline, I gave up trying to fix it and made another version of the table with just the changes that I needed. This was much easier for me to get into production because I could do it myself without involving other teams.?

Adam Kawa 6 个月前

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Adam Kawa 5 个月前

ProntoPro’s Data team - Gaining insights into the…

ProntoPro 2 年前

Sound familiar? Software developers had similar struggles before the advent of DevOps. With the DevOps revolution, the entire paradigm of how to build and ship software was reformed.?

As DevOps gained traction, unit tests, continuous integration, and continuous deployment became a standard part of software development.?

It’s time for data teams to have the same confidence as their peers. It’s time for DataOps.

No alt text provided for this image — DataOps = Data + DevOps

Introducing SQLMesh

We’ve built SQLMesh guided by three core principles:

Correctness is non-negotiable: Bad data is worse than no data. SQLMesh guarantees that your data will be consistent even in heavily collaborative environments.

Change with confidence: SQLMesh summarizes the impact of changes and provides automated guardrails empowering everyone to safely and quickly contribute.??

Efficiency without complexity: SQLMesh automatically optimizes your workloads by reusing tables and minimizing computation saving you time and money.

Here are some key features:

Efficient dev / staging environments: SQLMesh builds a “Virtual Data Warehouse” using views, which allows you to seamlessly rollback or roll forward your changes! Any data computation you run for validation purposes is not wasted. With a cheap pointer swap you re-use your “staging” data in production. This means you get unlimited copy-on-write environments that make data exploration and preview of changes easy and safe.?
Automatic DAG generation by semantically parsing and understanding SQL or Python scripts: No need to manually tag dependencies—SQLMesh was built with the ability to understand your entire data warehouse’s dependency graph.
Informative change summaries: Before making changes, SQLMesh will figure out what has changed and show the entire graph of affected jobs.
CI-Runnable Unit and Integration tests: Pipeline unit tests can be defined in YAML and run on every commit. SQLMesh can optionally convert your queries to DuckDB so that your tests can be self-contained.?
Smart change categorization: Column level lineage automatically determines whether changes are “breaking” or “non-breaking”, allowing you to correctly categorize changes and skip expensive backfills.
Easy incremental loads: Loading tables incrementally is as easy as a full refresh. SQLMesh transparently handles the complexity of tracking which intervals need loading, so all you have to do is specify a date filter.
Integrated with Airflow: You can schedule jobs with our simple built-in scheduler or use your existing Airflow cluster. SQLMesh can dynamically generate and push Airflow DAGs. We aim to support other schedulers like Dagster and Prefect in the future.
Notebook / CLI: Interact with SQLMesh with whatever tool you’re comfortable with.
Web based IDE (in preview): Edit, run, and visualize queries in your browser
Github CI/CD bot (in preview): A bot to tie your code directly to your data
DBT compatibility (in preview): Import and run an existing DBT project in SQLMesh’s runtime environment.

We’re just getting started on our journey to change the way data pipelines are built and deployed. We’re open sourcing as early as possible so that we can grow together with the community. Try out SQLMesh with a simple pip install by following the quick start guide. We’d love to chat and hear about your experiences and ideas in our Slack community.