登录查看更多内容

Road to Lakehouse?-?Part 1: Delta Lake data pipeline overview

Tam Nguyen

Senior AI Engineer | Generative AI | Machine Learning | LLM

发布日期: 2022年1月23日

This is a first part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For the code explanation, please take a look at the second part.

Road to Lakehouse - Part 2: Ingest and process data from Kafka with CDC and Delta Lake's CDF

What and why Data Lakehouse?

Lakehouse is a buzzword in the data field nowadays. According to AWS, the lake house architecture is about integrating a data lake, a data warehouse, and purpose-built stores to enable unified governance and easy data movement. Regarding to Databricks, the Lakehouse is an open architecture that combines the best elements of data lakes and data warehouses.

And for me, the Lakehouse is simply the data lake with both columnar files and delta files (transaction logs) in row based formats.

Delta Lake, Iceberg and Hudi are 3 popular data lake table formats that support ACID, schema evolution, upsert, time travel, incremental consumption, etc.

In this post, we will focus on the Delta Lake with support from Databricks.

Data lakehouse features

In comparison with the pure data lake, the Lakehouse provides:

ACID transactions to ensure data integrity and consistency as multiple parties concurrently read or write data.
Schema enforcement and evolution that we can optionally raise an error when data source schema changes or automatically merge it with the new one.
Time travel or data versioning allows us to access and revert to earlier versions of data, rollbacks or reproduce experiments.
Full DML supports UPDATE, DELETE and especially MERGE INTO is really useful in case of using with CDC and Delta Lake Change Data Feed (CDF).
End-to-end streaming eliminates the need for separating systems dedicated to serve real-time data applications because Delta Lake table is both a batch as well as a streaming source and sink

The Lakehouse also has more features compared with the traditional Data Warehouse:

Openness: The storage formats are open and standardized such as Apache Parquet which enables Delta tables to be queryable by different tools with or without Spark like Presto, Trino and different languages including Scala, Java, Rust, Python, Ruby,?.etc
Support for diverse data types ranging from unstructured to structured data, so we can not only store structured, semi-structured data and text in the Lakehouse but also binary files like images, video, audio,?.etc
Support for diverse workloads including data science, machine learning, and SQL analytics. It means that we do not need to move the data between data lake and data warehouse for training data, one Lakehouse destination is nearly enough for almost all purposes.

There are some highlights of Delta Lake:

Delta Sharing secures the way to share data with other organizations regardless of which computing platforms they use.
Data optimization including compaction (coalescing small files into larger ones) and ordering.
Vacuum: remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold.
Table caching in the Spark nodes’ local storage.

领英推荐

Proposal for a Management Architecture for Large…

INNOVANT 1 年前

Building the Future of Data Architecture with Apache…

Datastack Technologies 1 个月前

Orchestrating Data Workflows with Apache Airflow: A…

CHISQUARE LABS 4 个月前

Data pipeline?overview

For a demonstration, we will implement a below data pipeline in the Delta lake.

Regarding to data sources, there can be multiple sources such as:

OLTP systems logs can be captured and sent to Apache Kafka (1a)
Third party tools can also be integrated with the Kafka through various connectors (1b)
And back-end team can produce event data directly to the Kafka (1c)

2. Once we have the data in the Kafka, we will use Spark Structured Streaming to consume the data and load it into Delta tables in a raw area. Thanks to checkpointing which stores Kafka offsets in this case, we can recover from failures with exactly-once fault-tolerant. We need to enable the Delta Lake CDF feature in these raw tables in order to serve further layers.

3. We will use the Spark Structured Streaming again to ingest changes from the raw tables. Then we would do some transformations like flattening and exploding nested data,?.etc and load the cleansed data to a next area which is a refined zone. Remember to add the CDF in the refined tables properties.

4. Now we are ready to build data mart tables for business level by aggregating or joining tables from the refined area. This step is still in near real-time process because one more time we read the changes from previous layer tables by Spark Structured Streaming.

5. All the metadata is stored in the Databricks Data Catalog and all above tables can be queried in Databricks SQL Analytics, where we can create SQL endpoints and use a SQL editor. Beside the default catalog, we can use an open source Amundsen for the metadata discovery.

6. Eventually we can build some data visualizations from Databricks SQL Analytics Dashboards (formerly Redash) or use BI tools like Power BI, Streamlit, .etc.

Conclusion

The Lakehouse provides a comprehensive architecture that benefits from both Data Lake and Data Warehouse. Just with the straight data pipeline, we can serve for analytical and operational purposes. In the next part, we will ingest streaming data from MongoDB with Debezium and back-end events then load it into the first zone of the Lakehouse, after that we will process changes from the raw area and store cleansed and flattened data in a refined zone.

Source code

Because we are using Spark structured streaming to ingest, process and load the data, so there is a 24/7 running Spark cluster. We will implement all common libraries related to Kafka ingestion in a delta-lake-library repo, the built Python wheel file will be installed in the cluster so these libraries are shared among all Spark jobs and also notebooks. The second repo is delta-lake-pipeline which contains both streaming and batching jobs.

Afshin Alavi

Business Development & Technical Solution Executive | Unreal Engine Generalist

2 年

Hi, Thanks for the article. I got a question. What tools are mostly used for transforming phase in ETL? And I am talking about an on-premise db. Thanks

1 次回应

M?nh Hùng Phan

Data Engineer at VNPT IT

3 年

Hi Tam Nguyen, thank you for a great post. I have a question. Can I merge CDC data into Delta Lake effectively without Delta Lake's CDF. Because my system is deploy onpremise, not in Delta Lake on Databricks. And how long have you use Debezium to CDC from source database (which database you use in your product system), do you have had any problems with it?

1 次回应

查看更多评论

要查看或添加评论，请登录

Tam Nguyen的更多文章

Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI

2024年9月8日

Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI

This is the third part of the Data Lakehouse series, focusing on data analysis with Generative AI. Source code GitHub…

7 条评论
9 Methods to Enhance the Performance of an LLM RAG Application

2023年11月20日

9 Methods to Enhance the Performance of an LLM RAG Application

It is easy to prototype your first LLM RAG (Retrieval Augmented Generation) application, e.g.

4 条评论
Streaming Data Pipelines on Cloud Platforms: AWS and GCP

2020年4月2日

Streaming Data Pipelines on Cloud Platforms: AWS and GCP

There are a lot of solutions and products that AWS and GCP provide to you, and what services do they have for the…

8 条评论

Road to Lakehouse?-?Part 1: Delta Lake data pipeline overview

Tam Nguyen

Senior AI Engineer | Generative AI | Machine Learning | LLM

What and why Data Lakehouse?

Data lakehouse features

领英推荐

Data pipeline?overview

Conclusion

Source code

Tam Nguyen的更多文章

社区洞察

其他会员也浏览了

Apache Hudi: Copy on Write(CoW) Table

All About Parquet Part 09 - Parquet in Data Lake Architectures

When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability

Data Lakehouse Roundup #1 - News and Insights on the Lakehouse

Detailed Guide on DataBricks Delta?Lake- Part 1

5 Peta Byte Data Lake Design - Part 2

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

What and why Data Lakehouse?

Data lakehouse features

领英推荐

Data pipeline?overview

Conclusion

Source code

Tam Nguyen的更多文章

Road to Lakehouse? - ?Part 3: Data Analytics with Generative AI

9 Methods to Enhance the Performance of an LLM RAG Application

Streaming Data Pipelines on Cloud Platforms: AWS and GCP

社区洞察

其他会员也浏览了

Apache Hudi: Copy on Write(CoW) Table

All About Parquet Part 09 - Parquet in Data Lake Architectures

When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability

Data Lakehouse Roundup #1 - News and Insights on the Lakehouse

Detailed Guide on DataBricks Delta?Lake- Part 1

5 Peta Byte Data Lake Design - Part 2

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You