Lakehouse as Code | 04. Delta Live Tables Data Pipelines
Olivier Soucy
Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner
Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. From setting up Unity Catalog to orchestrating data pipelines and configuring your workspace, we’ve got everything covered.
Prefer to watch and learn? Check out the video on the same topic here: https://youtu.be/cX3EPV_xWrM
You can also access the source code here: https://github.com/okube-ai/okube-samples/tree/main/008-lac-dlt
Previous Parts:
Preface
Databricks has gained significant traction in recent years, with more businesses adopting it as their main data platform for production workloads. When moving beyond ad hoc exploration and experimentation, it's essential to adopt DevOps best practices to manage your data ecosystem—a discipline known as DataOps. One key component of this is infrastructure as code (IaC).
Currently, a common approach is for DevOps or IT teams to use Terraform to set up "core resources" like metastore, catalogs, and access grants, while Data Engineering teams manage notebooks and jobs using Databricks Asset Bundles (DABs). Often, dbt is also integrated to support data transformations. However, spreading ownership across multiple teams can create friction and inefficiencies—not to mention the steep learning curve for mastering all these technologies.
What if there was a simpler way? Enter Laktory—an all-in-one solution that allows you to define both data transformations and Databricks resources. Think of Laktory as the best of Terraform, Databricks Asset Bundles, and dbt combined into one streamlined tool.
Now, let’s dive in and see how Laktory can help you operate a data platform more efficiently.
Big Picture
The diagram below outlines some key components of a Lakehouse. Throughout this series, we’ll guide you through configuring and deploying each one of them.
Part 4 - Delta Live Tables Data Pipeline
In this fourht part, we focus on configuring a Databricks workspace. You’ll learn how to:
Pre-requisites
Before you begin, make sure you have the following:
Get Started
The quickest way to configure your Laktory Stack for workspace setup is by running the quickstart command through the Laktory CLI using the workflows template. This command will generate a fully functional stack that you can deploy using the laktory deploy command.
If you're using a Terraform backend, be sure to initialize the stack with the laktory init command as well.
Stack
The default stack covers several elements, including dbfsfiles, notebooks, jobs, and pipelines.
Since we've already discussed DBFS files, jobs, and pipeline jobs in previous post, this article will focus on configuring a Delta Live Tables (DLT) pipeline, so we’ll comment out the other elements.
Pipeline
In Laktory, pipeline models are a specialized way to declare data pipelines. Beyond specifying the cloud or Databricks resources, these models also let you define the expected data assets and the transformations applied to them.
Orchestrator
The orchestrator is where you define the compute resources that will execute the pipeline remotely. It determines which resources are deployed to Databricks. In this example, we’ll be using Delta Live Tables, though Databricks Jobs are also supported.
The key advantage of this setup is that you don’t need to manually create notebooks defining your tables and views. Instead, you can use the generic notebook provided with the stack, which will dynamically read the nodes defined in your configuration.
Nodes
Nodes represent the data assets (views or tables) you want to create. Each node produces a DataFrame by reading data from a source, applying transformations, and optionally writing the output to a sink. This fits seamlessly with the Delta Live Tables (DLT) concept.
When using DLT as the orchestrator, Laktory leverages dlt.read and dlt.read_stream methods, allowing DLT to manage tables and resolve dependencies automatically. For sink operations, DLT's internal mechanisms are used, and if no sink is defined, the node will result in a DLT view.
Streaming
In this pipeline, all node sources have the as_stream property enabled, which results in generating a Spark Structured Streaming DataFrame. This enables highly cost-efficient and time-effective incremental or merge operations.
Behind the scenes, enabling this option leverages technologies like Databricks Autoloader, spark.readStream, and dlt.read_stream to produce the desired output. When a node's DataFrame is read as a Spark Structured Streaming DataFrame, it will also be written as a stream.
This configuration allows for seamless switching between batch and streaming modes, providing an abstraction layer over the underlying technology. Whether you’re reading disk files, data tables, or another node, the pipeline operates efficiently, offering flexibility depending on the data source.
Bronze Stock Prices
In this example, the bronze table reads data from the DBFS files uploaded to the workspace and writes the output to a table called brz_stock_prices in the laktory schema of the lqs_dev catalog, as specified by the DLT catalog and schema properties. No transformations are applied at this stage.
Silver Stock Prices
The silver table uses the bronze node as its data source, which is equivalent to calling dlt.read_stream("brz_stock_prices"). This automatically creates the pipeline’s DAG and orchestrates tasks without needing explicit definitions. We also introduce the concept of expectations, which set quality targets on specific columns. Currently, this feature is supported only when using the DLT orchestrator, but it will be expanded to support other use cases in the future.
Between the bronze and silver stages, we define a series of transformations. Unlike previous examples where SQL statements were used to select columns, here we use Laktory’s built-in with_columns method, which allows for Spark-like declaration of new columns. Each column can be named, typed, and expressed either as an SQL expression or a Spark function, such as F.col("y") * F.sql("x").
The second transformation removes duplicates using the Spark function drop_duplicates, which is equivalent to:
One of Laktory’s standout features is the ability to chain multiple SQL or Spark DataFrame-based operations from a YAML configuration file. This approach offers flexibility and modularity when defining data transformations.
Execution
Let’s kick off our first deployment by running the laktory deploy command via the CLI.
Once deployed, navigate to the Delta Live Tables section in your Databricks Workspace, where you’ll find the newly created pipeline.
Remote Execution
After execution, you’ll notice that each node successfully results in the creation of a table.
Since the as_stream option was enabled, the tables are configured as streaming tables. If you re-run the pipeline without injecting new data, no rows in the tables will be modified, as demonstrated in subsequent runs.
The notebook supporting this execution is fully generic and can be reused for any pipeline configuration.
It works by reading and instantiating the pipeline configuration (passed as a parameter) and iterating over each node in that pipeline. For each node, it builds a define_table function, which creates a function that returns the desired DataFrame and applies the necessary DLT decorators for proper execution. Inside the get_df method, all that’s required is to call the node’s execute() method.
Laktory adds significant value here by enabling you to run this notebook interactively on a cluster and preview the results—something that is normally impossible with DLT. Below is the resulting output:
Local Execution
Prototyping and debugging DLT pipelines without actually running them is highly useful, but being able to execute them from your favorite IDE takes it to the next level. Laktory makes this possible thanks to its internal ETL capabilities. By executing the scripts/debug_pl.yaml script, you can set up a remote Spark session, load the pipeline from your local configuration, and run it, allowing you to examine and debug the results. This even works for streaming DataFrames. Below is an example output:
While there are some limitations on the types of DataFrames supported and isolated node execution, this capability is highly efficient for rapid iteration and prototyping—something we all wish was built directly into DLT.
Adding Pipeline Nodes
While a simple two-node data pipeline is a good starting point, most real-world scenarios require multiple bronze and silver tables, eventually combining them for analytical purposes. To extend our stocks pipeline, we’ll add bronze and silver tables for stock metadata, then join them with stock prices as shown below:
where
A few key details:
You have full control over how modular you want your pipeline to be. For a given node, all transformations can be wrapped into a single Spark function, or you can break them down into multiple sub-components. This flexibility depends on the level of readability you aim for, your testing strategy and how often you plan to reuse functions across different nodes.
Once the updated pipeline is deployed to Databricks, we re-run it and observe the expected DAG of tables based on the new nodes we added.
In just a few minutes, we’ve built a sophisticated Delta Live Tables pipeline driven entirely by a simple configuration file.
Summary
We successfully configured, deployed, and executed Delta Live Tables (DLT) pipeline jobs. Using simple YAML configuration files alongside the Laktory CLI, we defined and deployed data assets, transformations, and the necessary Databricks resources. Laktory’s framework offers highly modular data transformations, seamlessly supporting both SQL and Spark DataFrame operations. Combined with Databricks Delta Live Tables orchestration, it provides robust quality gates and fault tolerance features.
One of Laktory’s standout capabilities is its built-in execution framework, allowing you to develop and test pipelines directly from your favorite IDE—a true game-changer for developers.
I hope you’ve enjoyed the series so far! In the next part, we’ll explore how to reconfigure the same data pipeline for an incremental approach and execute it using Databricks Delta Live Tables.
What about you? Have you ever configured workspaces as code before? I’d love to hear about your experience—feel free to share your thoughts in the comments below!
And don’t forget to follow me for more insightful content on data and ML engineering.
#dataengineering #dataops #unitycatalog #databricks #laktory #etl #deltalivetables #dlt
Co-founder & CEO ?? Making Videos that Sell SaaS ?? Explain Big Ideas & Increase Conversion Rate!
1 个月Innovative integrations empower configurable pipelines. Keep enlightening us