登录查看更多内容

Delta Live Tables on Databricks. When should you use them and when should you not ?

YASH MAHENDRA JOSHI

Associate Manager@Accenture France | Data Engineering@Michelin Manufacturing | Professor for Spark, Big Data & Machine Learning

发布日期: 2023年7月15日

Delta Live Tables (DLTs) with #Databricks are great to use but come with their own constraints. I have had the pleasure on working on two projects that have implemented DLTs, once as a newbie to DLTs, and later this year with some more serious baggage and experience on the subject, and I will try to summarize some insights in this short article .

DLTs are a declarative ETL framework to building reliable end-to-end data pipelines: they add an abstraction layer over Spark (Structured Streaming for incremental stateful processing, for example) for better performance, management and visibility . You can access them from the Workflow tab in Databricks, and schedule them with Jobs (either from the UI or the Job API ) . But wait, first things first : what's the difference between a Delta table and a Delta Live tables ?

Delta table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta tables is a data format while Delta Live Tables is a data pipeline framework (source: https://learn.microsoft.com/en-us/azure/databricks/introduction/delta-comparison ) .

And why should we bother with DLTs? :) Let's look at the advantages.

Pipeline Visibility : The first simple reason is you get a Directed Acyclic Graph of your full data pipeline workloads -> this is super useful when you have complex pipelines, and you can visually see the data moving from your source tables in your bronze layer to the final aggregated tables in your gold layer, with number of rows written per table. Today, Unity Catalog also allows you to have this notion of tracability (Data Lineage ) , but for projects that don't have UC enabled yet, DLTs offer this kind of visibility using automated run-time lineage to capture dependencies between your tables and views. This helps across sectors and use cases : retail (real time telemetry), manufacturing (industrial IOT ), and supply chain (forecasting) where a real-time view of your data flow brings visibility to Data engineers, analysts and scientists.

CDC and Performance : A second REAL reason why you should use DLTs is Change Data Capture at speed. This is useful when you have to handle many merges, updates and deletes (CRUD operations) in your data flow from source to final tables. What I really liked is there is support for Slowly Changing Dimension Type 1 and Type 2 -> this means in your data lakehouse (where compute and storage are separate, not like Hadoop monolithic clusters) , you can?retain a history of records, either on all updates or on updates to a specified set of columns, by just setting the enableTrackHistory parameter in the configuration. And DLTs are in fact, very very fast -> they are autocompacted and are auto-optimized by default using the Delta format. Auto compaction?simply means compacting small files within Delta table partitions to automatically reduce small file problems. To give you a real example, a typical merge that took 35 minutes for data from 1 table took just 6 minutes with Delta Live Tables. For another example, I once used Zorder technique on the primary key rather than DLTs to reduce merge time, but this required knowledge of performance tuning, which all developers may not have . However, the downside is you have to learn a new syntax to make the merges in DLTs (Apply Changes Into rather than Merge ) so there is a learning curve involved which takes time.

{
  "configuration": {
    "pipelines.enableTrackHistory": "true"
  }
}

Simplicity : Generally, I find DLTs in SQL are simpler to debug and to create rather than DLTs in Python, especially if more other data engineers or business analysts want to view your code during unitary or functional testing. When developing DLT with Python, [email protected]?decorator is used to create a Delta Live Table. But in SQL, you just have to add the 'LIVE' word or 'STREAMING LIVE' words and basta -> CREATE LIVE TABLE X as SELECT * FROM db.Table, and it works for your materialised views.

Better and cheaper Streaming capabilities : A third reason is DLTs add an abstraction layer on top of Spark Structured Streaming for usability and performance, and Databricks in general is excellent for streaming use cases using micro-batching. We can use DLTs in both Kappa and Lambda architectures, which offers flexibility to architects and data engineers . You read in your raw data from Kafka topics into your bronze layer (there exists of course other types of event-driven architectures, this is just an example ) as well as from database systems , and apply incremental data updates , and joins in your silver layer, before aggregating in gold layers. And this can be put to SCALE with DLTs easily . In core mode, with a job cluster, DLTs can be significantly cheaper for typical streaming jobs that use an interactive cluster 24x7.

ACID (and not alkaline :p ) transactions: A point I often mention to my students at university is the ACID transactions : because DLTs use Delta format, there have the Atomicity, Consistency, Isolation and Durability in a unified processing framework that is needed in industry based solutions where quality of data and exactness of data transformation logic matters. Hence, it's also important to mention the notion of Data Quality expectations : I have been an advocate of data quality at source as well as data quality tracking within pipelines for the last few years, and the Great Expectations library had helped me a lot to implement DQ for a number of BI projects. Do we have this feature in DLTs ?

And all that is green is not gold... not even DLTs.

DLTs have expectations inbuilt to make this simpler for the developer : but be warned, this comes at almost double the cost !! You are obliged to choose the advanced mode in order to have expectations inbuilt. Below image shows for Azure Cloud with a premium subscription, DLTs cost $0.30 per DBU , but can cost almost double at $0.54 DBU if you have expectations and CDC also enabled in advanced mode. If you need it for your use case, go ahead, but keep in mind the additional bill !!

Also from a multi-cloud perspective, DLTs are significantly cheaper on AWS (I imagine this is because of the number of Databricks servers hosted on AWS ) than Azure, and Google doesn't have all the options as of yet. It would be nice to see a balanced pricing across all 3 cloud providers, as well as the possibility of using DLTs on Oracle Cloud or OVH with Databricks.

领英推荐

Data Bricks - The New Way to Manage Data Efficiently

Miracle Software Systems, Inc 11 个月前

Why Databricks: Use Cases for Databricks Data…

Analytics8 | Data & Analytics Consultancy 4 个月前

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 7 个月前

Limits to CPU: So, we can say so far that DLTs offer significant advantages but also some disadvantages. It's important to consider the scalability challenges and potential CPU consumption associated with large-scale deployments. On typical Azure subscriptions, there is a limit to CPUs available : anytime a new VM is deployed the vCPUs for the VM?must not exceed the vCPU quota for the VM size family?or the total regional vCPU quota. DLTs scale up quickly, but they consume a lot of CPU : even the maintenance pipelines scale upto 30 vCPUs for a few seconds to do their jobs, and this is not sustainable for organisations that have either mutualised Databricks workspaces for many projects, and/or strict limits to their CPU available. I hope Databricks addresses this issue in the future releases to have a more widespread adoption of DLTs.

Difficulties in debugging :Additionally, debugging complex workflows involving DLTs requires careful monitoring and robust logging mechanisms : while these are present in DLTs, you are obliged to use the separate UI, and this can be startling to a new developer with notions of Full Recompute, Full Refesh , and the inability to debug a dlt notebook with an interactive cluster can become tedious.

Enter?dlt-with-debug?a lightweight decorator utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run. Here is the github repo with more details : https://github.com/souvik-databricks/dlt-with-debug. While this is a workaround, again, it is not fully supported (yet) by Databricks, and it would be nice to see an option to debug DLTs from an interactive notebook.

To summarize, an amazing innovation , but not for all use cases. The following table summarizes the information:

LinkedIn, why can't we insert tables directly ?

Thanks for reading till the end, please like ,comment and share if you enjoyed reading this article. Here are some references for further reading on #databricks and #deltalivetables

References:

https://learn.microsoft.com/en-us/azure/databricks/introduction/delta-comparison

Andres A.

Azure Big Data Engineer

9 个月

great, thanks man, useful and practical explanation

Thomas Totter

Lead Data Engineer

10 个月

Thank you, that was an interesting read. You brought up some points, that i didn't have on my radar (yet) - very helpful! I am currently trying to evaluate different approaches for a streaming-heavy project and honestly the sheer number of options (which are on the table right now) is overwhelming. I have an additional question though: How would you rate the overall maturity of DLT as software product? Were there still "teething problems" that you ran into?

Oliver Scott

Software Engineer at Compass /

1 年

YASH MAHENDRA JOSHI thanks for the post, do you have sources available support this point: > In core mode, with a job cluster, DLTs can be significantly cheaper for typical streaming jobs that use an interactive cluster 24x7. From reading this post, DLT seems like the right choice for my case, however building side-by-side proof of concepts I have found that DLT is resource intensive and extremely opaque. My main bone to pick is that it runs hot on a 5-node auto-scaled cluster, whereas a single node using a Spark native implementation is running fine (and is more performant!). I spent a considerable amount of time trying to look into this, however I found the issue extremely difficult to debug with DLT. IMO there's a "black box" problem that needs to be addressed before it's production ready for many use-cases. I also have to call attention to the data quality checks, the real advantage here is that failed records should be emitted to an audit log, however I found the audit log to be extremely opaque and poorly documented. So we're 2x'ing cost for... what exactly? it's unclear to me.

1 次回应

查看更多评论

要查看或添加评论，请登录

YASH MAHENDRA JOSHI的更多文章

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

2024年6月27日

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

Apache Spark: A Milestone in Data Engineering Apache Spark is an open-source distributed computing system which…

2 条评论
Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

2024年6月17日

Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

Constructing a house takes time, and Databricks is a bit like masonry – you have different building blocks (from…

2 条评论
Data Lineage on Databricks: what's good and what could be better

2023年12月1日

Data Lineage on Databricks: what's good and what could be better

Finally been able to get my hands on #UnityCatalog set up on #Databricks, and play around with #DataLineage, which is a…

1 条评论
Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

2022年8月10日

Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

Pandas is one of world's most loved open-source data analysis and manipulation library built on top of the Python…
What tools does a good Data Engineer in 2022 need to know : Part 1

2022年1月22日

What tools does a good Data Engineer in 2022 need to know : Part 1

It's been a while since I last wrote an article on LinkedIn. During the last 6 months or so, my competencies in the…

1 条评论
Learning to tackle Big Data problems the right way

2021年11月17日

Learning to tackle Big Data problems the right way

We live in a world of exponentially increasing data. Data that is growing in volumetry, in velocity, and in variety.

3 条评论
Making machine learning forecasts work : a use case of Bitcoin forecasting

2021年10月8日

Making machine learning forecasts work : a use case of Bitcoin forecasting

Forecasting is a complex domain, but not one that is impossible to do with the help of machine learning. Firstly, let's…

4 条评论
What to expect when you are a developer in the AI field: some key lessons I have learnt.

2021年6月9日

What to expect when you are a developer in the AI field: some key lessons I have learnt.

Functional knowledge is key to success in ML/AI projects. Having worked in the supply chain, it's very important to…

1 条评论
From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

2021年3月28日

From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

“Learning never exhausts the mind.” – Leonardo da Vinci Let me be very honest with you, Power BI wasn't at all on my…

5 条评论
2 years into France....and a day in the life of a Data engineer at CGI

2020年10月25日

2 years into France....and a day in the life of a Data engineer at CGI

2 years have flown by since I first landed in Clermont-ferrand, France. A Masters in Business intelligence at ESC…

8 条评论

See all articles

Delta Live Tables on Databricks. When should you use them and when should you not ?

YASH MAHENDRA JOSHI

Associate Manager@Accenture France | Data Engineering@Michelin Manufacturing | Professor for Spark, Big Data & Machine Learning

领英推荐

YASH MAHENDRA JOSHI的更多文章

社区洞察

其他会员也浏览了

Databricks Unity Catalog - Best Practices

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

How to use Databricks Unity Catalog to implement Data model of Bronze, Silver, and Gold layer in Delta Lakehouse

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Azure Data and Power BI News (Build Edition)

Exploring Databricks Lakehouse: Transforming Your Data Strategy

?? End-to-End Databricks & Spark Project #1: From Business Comprehension to Data Pipelines, Data Ingestion and Bronze Layer

Storage in Microsoft Fabric

Why Databricks: Use Cases for Databricks Data Intelligence

Native and Agnostic Data Platforms

领英推荐

YASH MAHENDRA JOSHI的更多文章

Apache Spark 1.0 to 4.0 : Key advancements of this big-data framework's 14 year journey.

Embracing Excellence: My Journey to becoming a Databricks Champion at Accenture

Data Lineage on Databricks: what's good and what could be better

Distributed Big Data Processing with Pyspark.Pandas - Pros and Cons

What tools does a good Data Engineer in 2022 need to know : Part 1

Learning to tackle Big Data problems the right way

Making machine learning forecasts work : a use case of Bitcoin forecasting

What to expect when you are a developer in the AI field: some key lessons I have learnt.

From Zero to Hero: how to learn Power BI quickly - excerpts from my learning journey.

2 years into France....and a day in the life of a Data engineer at CGI

社区洞察

其他会员也浏览了

Databricks Unity Catalog - Best Practices

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

How to use Databricks Unity Catalog to implement Data model of Bronze, Silver, and Gold layer in Delta Lakehouse

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Azure Data and Power BI News (Build Edition)

Exploring Databricks Lakehouse: Transforming Your Data Strategy

?? End-to-End Databricks & Spark Project #1: From Business Comprehension to Data Pipelines, Data Ingestion and Bronze Layer

Storage in Microsoft Fabric

Why Databricks: Use Cases for Databricks Data Intelligence

Native and Agnostic Data Platforms