Road to Lakehouse?-?Part 1: Delta Lake data pipeline overview
This is a first part of the Data Lakehouse and data pipelines implementation in the Delta Lake. Source code GitHub repositories are at the end of this article. For the code explanation, please take a look at the second part.
Road to Lakehouse - Part 2: Ingest and process data from Kafka with CDC and Delta Lake's CDF
What and why Data Lakehouse?
Lakehouse is a buzzword in the data field nowadays. According to AWS, the lake house architecture is about integrating a data lake, a data warehouse, and purpose-built stores to enable unified governance and easy data movement. Regarding to Databricks, the Lakehouse is an open architecture that combines the best elements of data lakes and data warehouses.
And for me, the Lakehouse is simply the data lake with both columnar files and delta files (transaction logs) in row based formats.
Delta Lake, Iceberg and Hudi are 3 popular data lake table formats that support ACID, schema evolution, upsert, time travel, incremental consumption, etc.
In this post, we will focus on the Delta Lake with support from Databricks.
Data lakehouse features
In comparison with the pure data lake, the Lakehouse provides:
The Lakehouse also has more features compared with the traditional Data Warehouse:
There are some highlights of Delta Lake:
领英推荐
Data pipeline?overview
For a demonstration, we will implement a below data pipeline in the Delta lake.
2. Once we have the data in the Kafka, we will use Spark Structured Streaming to consume the data and load it into Delta tables in a raw area. Thanks to checkpointing which stores Kafka offsets in this case, we can recover from failures with exactly-once fault-tolerant. We need to enable the Delta Lake CDF feature in these raw tables in order to serve further layers.
3. We will use the Spark Structured Streaming again to ingest changes from the raw tables. Then we would do some transformations like flattening and exploding nested data,?.etc and load the cleansed data to a next area which is a refined zone. Remember to add the CDF in the refined tables properties.
4. Now we are ready to build data mart tables for business level by aggregating or joining tables from the refined area. This step is still in near real-time process because one more time we read the changes from previous layer tables by Spark Structured Streaming.
5. All the metadata is stored in the Databricks Data Catalog and all above tables can be queried in Databricks SQL Analytics, where we can create SQL endpoints and use a SQL editor. Beside the default catalog, we can use an open source Amundsen for the metadata discovery.
6. Eventually we can build some data visualizations from Databricks SQL Analytics Dashboards (formerly Redash) or use BI tools like Power BI, Streamlit, .etc.
Conclusion
The Lakehouse provides a comprehensive architecture that benefits from both Data Lake and Data Warehouse. Just with the straight data pipeline, we can serve for analytical and operational purposes. In the next part, we will ingest streaming data from MongoDB with Debezium and back-end events then load it into the first zone of the Lakehouse, after that we will process changes from the raw area and store cleansed and flattened data in a refined zone.
Source code
Because we are using Spark structured streaming to ingest, process and load the data, so there is a 24/7 running Spark cluster. We will implement all common libraries related to Kafka ingestion in a delta-lake-library repo, the built Python wheel file will be installed in the cluster so these libraries are shared among all Spark jobs and also notebooks. The second repo is delta-lake-pipeline which contains both streaming and batching jobs.
Business Development & Technical Solution Executive | Unreal Engine Generalist
2 年Hi, Thanks for the article. I got a question. What tools are mostly used for transforming phase in ETL? And I am talking about an on-premise db. Thanks
Data Engineer at VNPT IT
3 年Hi Tam Nguyen, thank you for a great post. I have a question. Can I merge CDC data into Delta Lake effectively without Delta Lake's CDF. Because my system is deploy onpremise, not in Delta Lake on Databricks. And how long have you use Debezium to CDC from source database (which database you use in your product system), do you have had any problems with it?