Delta Live Tables on Databricks. When should you use them and when should you not ?
YASH MAHENDRA JOSHI
Associate Manager@Accenture France | Data Engineering@Michelin Manufacturing | Professor for Spark, Big Data & Machine Learning
Delta Live Tables (DLTs) with #Databricks are great to use but come with their own constraints. I have had the pleasure on working on two projects that have implemented DLTs, once as a newbie to DLTs, and later this year with some more serious baggage and experience on the subject, and I will try to summarize some insights in this short article .
DLTs are a declarative ETL framework to building reliable end-to-end data pipelines: they add an abstraction layer over Spark (Structured Streaming for incremental stateful processing, for example) for better performance, management and visibility . You can access them from the Workflow tab in Databricks, and schedule them with Jobs (either from the UI or the Job API ) . But wait, first things first : what's the difference between a Delta table and a Delta Live tables ?
Delta table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data format while Delta Live Tables is a data pipeline framework (source: https://learn.microsoft.com/en-us/azure/databricks/introduction/delta-comparison ) .
And why should we bother with DLTs? :) Let's look at the advantages.
Pipeline Visibility : The first simple reason is you get a Directed Acyclic Graph of your full data pipeline workloads -> this is super useful when you have complex pipelines, and you can visually see the data moving from your source tables in your bronze layer to the final aggregated tables in your gold layer, with number of rows written per table. Today, Unity Catalog also allows you to have this notion of tracability (Data Lineage ) , but for projects that don't have UC enabled yet, DLTs offer this kind of visibility using automated run-time lineage to capture dependencies between your tables and views. This helps across sectors and use cases : retail (real time telemetry), manufacturing (industrial IOT ), and supply chain (forecasting) where a real-time view of your data flow brings visibility to Data engineers, analysts and scientists.
CDC and Performance : A second REAL reason why you should use DLTs is Change Data Capture at speed. This is useful when you have to handle many merges, updates and deletes (CRUD operations) in your data flow from source to final tables. What I really liked is there is support for Slowly Changing Dimension Type 1 and Type 2 -> this means in your data lakehouse (where compute and storage are separate, not like Hadoop monolithic clusters) , you can?retain a history of records, either on all updates or on updates to a specified set of columns, by just setting the enableTrackHistory parameter in the configuration. And DLTs are in fact, very very fast -> they are autocompacted and are auto-optimized by default using the Delta format. Auto compaction?simply means compacting small files within Delta table partitions to automatically reduce small file problems. To give you a real example, a typical merge that took 35 minutes for data from 1 table took just 6 minutes with Delta Live Tables. For another example, I once used Zorder technique on the primary key rather than DLTs to reduce merge time, but this required knowledge of performance tuning, which all developers may not have . However, the downside is you have to learn a new syntax to make the merges in DLTs (Apply Changes Into rather than Merge ) so there is a learning curve involved which takes time.
{
"configuration": {
"pipelines.enableTrackHistory": "true"
}
}
Simplicity : Generally, I find DLTs in SQL are simpler to debug and to create rather than DLTs in Python, especially if more other data engineers or business analysts want to view your code during unitary or functional testing. When developing DLT with Python, [email protected]?decorator is used to create a Delta Live Table. But in SQL, you just have to add the 'LIVE' word or 'STREAMING LIVE' words and basta -> CREATE LIVE TABLE X as SELECT * FROM db.Table, and it works for your materialised views.
Better and cheaper Streaming capabilities : A third reason is DLTs add an abstraction layer on top of Spark Structured Streaming for usability and performance, and Databricks in general is excellent for streaming use cases using micro-batching. We can use DLTs in both Kappa and Lambda architectures, which offers flexibility to architects and data engineers . You read in your raw data from Kafka topics into your bronze layer (there exists of course other types of event-driven architectures, this is just an example ) as well as from database systems , and apply incremental data updates , and joins in your silver layer, before aggregating in gold layers. And this can be put to SCALE with DLTs easily . In core mode, with a job cluster, DLTs can be significantly cheaper for typical streaming jobs that use an interactive cluster 24x7.
ACID (and not alkaline :p ) transactions: A point I often mention to my students at university is the ACID transactions : because DLTs use Delta format, there have the Atomicity, Consistency, Isolation and Durability in a unified processing framework that is needed in industry based solutions where quality of data and exactness of data transformation logic matters. Hence, it's also important to mention the notion of Data Quality expectations : I have been an advocate of data quality at source as well as data quality tracking within pipelines for the last few years, and the Great Expectations library had helped me a lot to implement DQ for a number of BI projects. Do we have this feature in DLTs ?
And all that is green is not gold... not even DLTs.
DLTs have expectations inbuilt to make this simpler for the developer : but be warned, this comes at almost double the cost !! You are obliged to choose the advanced mode in order to have expectations inbuilt. Below image shows for Azure Cloud with a premium subscription, DLTs cost $0.30 per DBU , but can cost almost double at $0.54 DBU if you have expectations and CDC also enabled in advanced mode. If you need it for your use case, go ahead, but keep in mind the additional bill !!
Also from a multi-cloud perspective, DLTs are significantly cheaper on AWS (I imagine this is because of the number of Databricks servers hosted on AWS ) than Azure, and Google doesn't have all the options as of yet. It would be nice to see a balanced pricing across all 3 cloud providers, as well as the possibility of using DLTs on Oracle Cloud or OVH with Databricks.
领英推荐
Limits to CPU: So, we can say so far that DLTs offer significant advantages but also some disadvantages. It's important to consider the scalability challenges and potential CPU consumption associated with large-scale deployments. On typical Azure subscriptions, there is a limit to CPUs available : anytime a new VM is deployed the vCPUs for the VM?must not exceed the vCPU quota for the VM size family?or the total regional vCPU quota. DLTs scale up quickly, but they consume a lot of CPU : even the maintenance pipelines scale upto 30 vCPUs for a few seconds to do their jobs, and this is not sustainable for organisations that have either mutualised Databricks workspaces for many projects, and/or strict limits to their CPU available. I hope Databricks addresses this issue in the future releases to have a more widespread adoption of DLTs.
Difficulties in debugging :Additionally, debugging complex workflows involving DLTs requires careful monitoring and robust logging mechanisms : while these are present in DLTs, you are obliged to use the separate UI, and this can be startling to a new developer with notions of Full Recompute, Full Refesh , and the inability to debug a dlt notebook with an interactive cluster can become tedious.
Enter?dlt-with-debug?a lightweight decorator utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run. Here is the github repo with more details : https://github.com/souvik-databricks/dlt-with-debug. While this is a workaround, again, it is not fully supported (yet) by Databricks, and it would be nice to see an option to debug DLTs from an interactive notebook.
To summarize, an amazing innovation , but not for all use cases. The following table summarizes the information:
LinkedIn, why can't we insert tables directly ?
Thanks for reading till the end, please like ,comment and share if you enjoyed reading this article. Here are some references for further reading on #databricks and #deltalivetables
References:
https://learn.microsoft.com/en-us/azure/databricks/introduction/delta-comparison
Azure Big Data Engineer
9 个月great, thanks man, useful and practical explanation
Lead Data Engineer
10 个月Thank you, that was an interesting read. You brought up some points, that i didn't have on my radar (yet) - very helpful! I am currently trying to evaluate different approaches for a streaming-heavy project and honestly the sheer number of options (which are on the table right now) is overwhelming. I have an additional question though: How would you rate the overall maturity of DLT as software product? Were there still "teething problems" that you ran into?
Software Engineer at Compass /
1 年YASH MAHENDRA JOSHI thanks for the post, do you have sources available support this point: > In core mode, with a job cluster, DLTs can be significantly cheaper for typical streaming jobs that use an interactive cluster 24x7. From reading this post, DLT seems like the right choice for my case, however building side-by-side proof of concepts I have found that DLT is resource intensive and extremely opaque. My main bone to pick is that it runs hot on a 5-node auto-scaled cluster, whereas a single node using a Spark native implementation is running fine (and is more performant!). I spent a considerable amount of time trying to look into this, however I found the issue extremely difficult to debug with DLT. IMO there's a "black box" problem that needs to be addressed before it's production ready for many use-cases. I also have to call attention to the data quality checks, the real advantage here is that failed records should be emitted to an audit log, however I found the audit log to be extremely opaque and poorly documented. So we're 2x'ing cost for... what exactly? it's unclear to me.