Databricks Processing using Medallion Data Quality Zones - Part 1
Problem
The Delta Lakehouse design uses a medallion (bronze, silver, and gold) architecture for data quality. How can we abstract the read and write actions in Spark to create a dynamic notebook to process data files?
Solution
The data movement between the bronze and silver zones is a consistent pattern. Therefore, we will build generic read and write functions to handle various file types. Once these functions are tested, we can put the pieces together to create and schedule a dynamic notebook.
Business Problem
The top management at the Adventure Works company is interested in creating a Delta Lakehouse. The image above shows how the data quality improves when files are processed from left to right.
In my design, I will use a stage zone. This storage container contains just today's data file, while the bronze zone will keep a copy of all data files. This may be a requirement for highly regulated industries that need a file audit trail.
Details
This is the first article of four on how to process data using a meta data driven design. Please see todays MS SQL TIPS article for details.
Director of Global Data Management @ Insight
1 年Good read, John. Keep up the great work!