登录查看更多内容

Databricks PySpark Type 2 SCD Function for Azure Dedicated SQL Pools

Rory McManus

Owner @Data Mastery | Need help with Data?

发布日期: 2021年4月21日

Slowly Changing Dimensions (SCD) is a commonly used dimensional modeling technique used in data warehousing to capture the changing data within the dimension (Image 1) over time. The three most commonly used SCD Types are 0, 1, 2.

The majority of DW/BI projects have type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and a new record created allowing for a complete history of the data changes. See example below

Data Before

Data After

I used this in my latest project which was with an electrical distribution company with the aim to track the approved Australian Energy Regulator Tariff changes over time.

Today I’m going to share with you have to how to create a reusable PySpark function that can be reused across Databricks workflows with minimal effort.

Type 2 SCD PySpark Function

Before we start writing code we must understand the Databricks Azure Synapse Analytics connector. It supports read/write operations and accepts valid SQL statements in pre-action or post-action operations before or after writing to the table. Therefore to create this function the code must form the valid SQL statement that it passes to the connector.

Prerequisite

Azure Synapse Staging and Destination tables must be created for optimal performance and storage

Functionality

The function accepts an input Dataframe
When lookup columns are passed to the function they are used to join the staging and destination tables. The lookup columns provide the ability to only close off rows coming in from the Input Dataframe that are currently open in the destination table with an Effective End Date of 9999–12–31 with the current date-time
If no lookup column(s) are passed no join condition is used and all current records are closed off and all the new input records are the open records
The function is capable of handling different input different Date\DateTime formats and will output a uniform DateTime format agreed with the business. Example below:

Input Format: 01/01/2020 or 2010–01–01

OuputFormat: 2020–01–01 00:00:00.000

Close off the currently open records — depending on the Lookup columns provided
Insert the new data set setting the new record with an Open DateTime of ‘9999–12–31 00.00.00.000’

领英推荐

Hiding within those mounds of data is knowledge that…

Santosh Raman Mishra 4 年前

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache…

Adam Kawa 1 年前

Mastering the Technical Stacks: A Guide for Data &…

Douglas Robertson 1 年前

Input Parameters

Dataframe df
outputDWTable: Name of the output Synapse Analytics Table e.g. DW.Tariffs
outputDWStagingTable: Name of the output Synapse Analytics Staging Table e.g. DW.StagingTariffs
dateParameters: Takes the following in Json KeyValue Pairs

{

“scd_effective_date_column”:”<Enter Here - Input Date Column>”,

“scd_end_timestamp_column”:”<Enter Here - SCD EndDate Column Name>”,

“scd_start_timestamp_column”:”<Enter Here - SCD StartDate Column Name>”, -

“scd_TimeStampFormat”:”<Enter Here — Source Date Format, this is used to convert to destination DateType>”,

“lookupColumns”:”<Enter lookupcolumn(s) pipe seperated>”

}

Code

Please see the comments on each block of code for an explanation.

Conclusion

If you would like a copy please drop me a message and I can send you a link to my private GIT repo.

I hope you have found this helpful and through its use, will save you time writing a PySpark Type 2 SCD function. Any thoughts, questions, corrections and suggestions are very welcome :)

Please share on LinkedIn if you found this useful #DataMastery #DataEngineering #Share #Community #Databricks #PySpark #Type2 #SCD

Murtaza Somani

Azure Data Engineer | Azure Databricks | Azure Synapse Analytics Consultant

1 年

Hi Rory Can you please send the repo link? Thanks

Upendra Killi

Lead Software Engineer, Impetus | Data Engineer

2 年

Great article on pyspark SCD2. Thanks for sharing this. Looking forward to implement it. Thanks Rory McManus

1 次回应

Adam Nevin

Azure Data Engineer

2 年

Great article. Thanks Rory McManus

1 次回应

Madhuvamsi Bannuru

2 年

Thanks Rory for sharing this , this is really helpful

查看更多评论

要查看或添加评论，请登录

Rory McManus的更多文章

Azure Data Explorer: Real-Time Analytics - Palo Alto Web Traffic Logs

2022年9月7日

Azure Data Explorer: Real-Time Analytics - Palo Alto Web Traffic Logs

Since remote working has become the norm, risk and information security teams are operating in a completely different…

4 条评论
IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

2022年8月24日

IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

Modern businesses have an overwhelming amount of data available to them from a huge number of IoT devices and…

4 条评论
What is Databricks Auto Loader?

2022年8月15日

What is Databricks Auto Loader?

Databricks is a scalable big data analytics platform designed for data science and data engineering. Built on top of…

5 条评论
What is Azure Data Explorer?

2022年4月17日

What is Azure Data Explorer?

Azure Data Explorer (ADX) is a fully managed data analytics service for real-time analysis on large volumes of data…

2 条评论
Azure Data Explorer: Real-Time Analytics - Fortinet Logs

2021年11月15日

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

I recently used Data Explorer with an education client to migrate an existing Kafka workload which ingests and…

13 条评论

See all articles

Databricks PySpark Type 2 SCD Function for Azure Dedicated SQL Pools

Rory McManus

Owner @Data Mastery | Need help with Data?

Type 2 SCD PySpark Function

领英推荐

Conclusion

Rory McManus的更多文章

社区洞察

其他会员也浏览了

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

Building a Data Pipeline in Microsoft Fabric & Transforming Data with PySpark

Best Practices and Spark optimisation Tips for Data engineers

Data Formats and Compression in Data Engineering: Best Practices for CSV, Excel, JSON, Parquet, and Avro

A unified platform with Databricks & dbt

Apache Spark & PySpark, Today's Big Data Need.

Streamlining Schema Evolution in Databricks with Alembic and SQLAlchemy

Understanding the Parquet File Format: A Deep Dive into Performance and Efficiency

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Empowering Big Data Analysis: Leveraging Spark SQL agent and Azure OpenAI for Delta Lake Processing in Databricks

Type 2 SCD PySpark Function

领英推荐

Conclusion

Rory McManus的更多文章

Azure Data Explorer: Real-Time Analytics - Palo Alto Web Traffic Logs

IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

What is Databricks Auto Loader?

What is Azure Data Explorer?

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

社区洞察

其他会员也浏览了

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

Building a Data Pipeline in Microsoft Fabric & Transforming Data with PySpark

Best Practices and Spark optimisation Tips for Data engineers

Data Formats and Compression in Data Engineering: Best Practices for CSV, Excel, JSON, Parquet, and Avro

A unified platform with Databricks & dbt

Apache Spark & PySpark, Today's Big Data Need.

Streamlining Schema Evolution in Databricks with Alembic and SQLAlchemy

Understanding the Parquet File Format: A Deep Dive into Performance and Efficiency

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Empowering Big Data Analysis: Leveraging Spark SQL agent and Azure OpenAI for Delta Lake Processing in Databricks