Detecting schema and data Drift within automation - How Azure can help?
Satya Shyam K Jayanty
Data Advisory-Leadership, Enterprise Data Architect, Data Governance Advocate, Data~Cloud Strategy & Microsoft MVP (2006-2020), experienced Enterprise Data Architect ~ Author, Data Community Influencer and DAMA Mentor
There is no doubt that cloud computing has given new horizon to the data platform world!
Using a relational database service in a the cloud based with a market leading provider is good enough to deliver predictable performance, scalability and business continuity. Based on my recent consulting engagements I believe and with no doubt I can say Microsoft Azure delivers such a continuity with less administration. For sure the IT Pros need to ramp up their skills as the scenario is different to what they used to do in an on-premise/data center arena.
ETL and ELT plays major role to automate data ingestion patterns and deliver data movements. In oder to orchestrate and manage the data pipelines, Azure Data Factory (ADF) came up long way to enhance the methods.
One of the solutions that were designed had this requirement of finding changes to metadata from the data sources, which is called Schema Drift. This can be accomplished with a code but not to extensive levels, if not the data flow exposed and unprotected to those changes which will cause erroneous ETL/ELT patterns failure. In case of incoming (source) columns and fields are changed frequently to protect against such a failure Schema Drift option/feature is useful for the data engineering.
ADF has this native support for flexible schemas from execution which helps to build generic data transformation logic without the need to recompile entire data flow. To enable this feature, within ADF at the Source Transformation option choose as follows:
Few scenarios to consider when this option is selected, by default all incoming fields will be screened from the source within data flow execution and passed through flow to the sink. Whenever there is a new column(s) are detected (which we can call drifted columns) will be chosen as String data type by default. To avoid this data type change, you can choose 'Infer drifted column types' where ADF automatically infers data types. Within the Sink side make sure to choose 'Auto-Map' that will map new fields in Sink Transformation and landed to the destination.
Similar to the Schema Drift, there is another requriement within Data Science project related to Machine Learning models to detect changes to the data between trained model & deployed model. No doubt this will cause major issues within the Machine Learning predictive models. This is called Data Drift.
Azure Machine Learning service can help to monitor the input models when used with Azure Kubernetes Services (AKS), this service is in preview (as of now) and limited configuration. Microsoft documentation refers:
With Azure Machine Learning service, you can monitor the inputs to a model deployed on AKS and compare this data to the training dataset for the model. At regular intervals, the inference data is snapshot and profiled, then computed against the baseline dataset to produce a data drift analysis that:
Measures the magnitude of data drift, called the drift coefficient.
Measures the data drift contribution by feature, informing which features caused data drift.
Measures distance metrics. Currently Wasserstein and Energy Distance are computed.
Measures distributions of features. Currently kernel density estimation and histograms.
Send alerts to data drift by email.
Using Azure Machine Learning service, data drift is monitored through datasets or deployments. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A second dataset - usually model input data gathered from a deployment - is tested against the baseline dataset. Both datasets are profiled and input to the data drift monitoring service.
More information :Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)
Helping enterprises derive value from all kinds of data (Data & AI Solution Specialist) @Microsoft
5 年Great article Satya!