登录查看更多内容

Detecting schema and data Drift within automation - How Azure can help?

Satya Shyam K Jayanty

Data Advisory-Leadership, Enterprise Data Architect, Data Governance Advocate, Data~Cloud Strategy & Microsoft MVP (2006-2020), experienced Enterprise Data Architect ~ Author, Data Community Influencer and DAMA Mentor

发布日期: 2019年9月11日

There is no doubt that cloud computing has given new horizon to the data platform world!

Using a relational database service in a the cloud based with a market leading provider is good enough to deliver predictable performance, scalability and business continuity. Based on my recent consulting engagements I believe and with no doubt I can say Microsoft Azure delivers such a continuity with less administration. For sure the IT Pros need to ramp up their skills as the scenario is different to what they used to do in an on-premise/data center arena.

ETL and ELT plays major role to automate data ingestion patterns and deliver data movements. In oder to orchestrate and manage the data pipelines, Azure Data Factory (ADF) came up long way to enhance the methods.

One of the solutions that were designed had this requirement of finding changes to metadata from the data sources, which is called Schema Drift. This can be accomplished with a code but not to extensive levels, if not the data flow exposed and unprotected to those changes which will cause erroneous ETL/ELT patterns failure. In case of incoming (source) columns and fields are changed frequently to protect against such a failure Schema Drift option/feature is useful for the data engineering.

ADF has this native support for flexible schemas from execution which helps to build generic data transformation logic without the need to recompile entire data flow. To enable this feature, within ADF at the Source Transformation option choose as follows:

Few scenarios to consider when this option is selected, by default all incoming fields will be screened from the source within data flow execution and passed through flow to the sink. Whenever there is a new column(s) are detected (which we can call drifted columns) will be chosen as String data type by default. To avoid this data type change, you can choose 'Infer drifted column types' where ADF automatically infers data types. Within the Sink side make sure to choose 'Auto-Map' that will map new fields in Sink Transformation and landed to the destination.

Similar to the Schema Drift, there is another requriement within Data Science project related to Machine Learning models to detect changes to the data between trained model & deployed model. No doubt this will cause major issues within the Machine Learning predictive models. This is called Data Drift.

Azure Machine Learning service can help to monitor the input models when used with Azure Kubernetes Services (AKS), this service is in preview (as of now) and limited configuration. Microsoft documentation refers:

With Azure Machine Learning service, you can monitor the inputs to a model deployed on AKS and compare this data to the training dataset for the model. At regular intervals, the inference data is snapshot and profiled, then computed against the baseline dataset to produce a data drift analysis that:

Measures the magnitude of data drift, called the drift coefficient.

Measures the data drift contribution by feature, informing which features caused data drift.

Measures distance metrics. Currently Wasserstein and Energy Distance are computed.

Measures distributions of features. Currently kernel density estimation and histograms.

Send alerts to data drift by email.

Using Azure Machine Learning service, data drift is monitored through datasets or deployments. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A second dataset - usually model input data gathered from a deployment - is tested against the baseline dataset. Both datasets are profiled and input to the data drift monitoring service.

More information :Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)

Rafi Benjaro

Helping enterprises derive value from all kinds of data (Data & AI Solution Specialist) @Microsoft

5 年

Great article Satya!

查看更多评论

要查看或添加评论，请登录

Satya Shyam K Jayanty的更多文章

Alert: ..100% pass guarantee for over 250 certifications... classic "brain dump" or "proxy exam" scam

2025年2月20日

Alert: ..100% pass guarantee for over 250 certifications... classic "brain dump" or "proxy exam" scam

Recently I have been flooded with numerous InMail offering 100% pass guarantee for over 250 certifications, including…

2 条评论
Data Quality and Consistency: The Foundation of a Successful Data Mesh

2025年2月11日

Data Quality and Consistency: The Foundation of a Successful Data Mesh

This is a data-driven world, not a small statement to make in today's fast pacing environment. In the recent times I…
Why data governance is critical for Data Mesh?

2025年1月21日

Why data governance is critical for Data Mesh?

In the recent times I have been asked about this question, and here is my perspective on why data governance is…
Data Architecture challenges

2024年12月3日

Data Architecture challenges

Planning and strategy to re-route from troubled path! Recently at the work-place there was a situation that led to a…
#Data #Governance – Navigating #Challenges and Unlocking #Opportunities in a #Digital #Age - 11:00 - 11:45 am (Central US ET) - #DataDays Conference

2024年11月12日

#Data #Governance – Navigating #Challenges and Unlocking #Opportunities in a #Digital #Age - 11:00 - 11:45 am (Central US ET) - #DataDays Conference

Excited to announce that I’ll be presenting at Data Days, hosted by IDERA Software Join me virtually for an insightful…
Data Governance – challenges, strategy and opportunities

2024年11月7日

Data Governance – challenges, strategy and opportunities

The title says-it-all! So, how would it help in navigating challenges and unlocking opportunities in a Digital Age?…

1 条评论
Ethical Data Governance: Navigating Challenges, Seizing Opportunities and Building Strategies in Data-Driven Environments

2024年9月16日

Ethical Data Governance: Navigating Challenges, Seizing Opportunities and Building Strategies in Data-Driven Environments

On Friday 13th September, I had the pleasure of presenting on 'Ethical Data Governance' at the DATA:Scotland…
Data Saturday Edinburgh (#DS30) make your day worthwhile

2023年5月11日

Data Saturday Edinburgh (#DS30) make your day worthwhile

Data Saturday Edinburgh is a free-to-attend in-person event (one day) for IT professionals who are keen on data…
AWS - Amazon Web Services START UP Day Global Series

2019年9月12日

AWS - Amazon Web Services START UP Day Global Series

Startup businesses especially within the technology & user access arena have been quite popular in the last 5 years…
Compare The Market - Data Science vs Analytical Skills

2016年10月19日

Compare The Market - Data Science vs Analytical Skills

"You can have data without information, but you cannot have information without data." - Daniel Keys Moran By far, from…

2 条评论

See all articles

Detecting schema and data Drift within automation - How Azure can help?

Satya Shyam K Jayanty

Data Advisory-Leadership, Enterprise Data Architect, Data Governance Advocate, Data~Cloud Strategy & Microsoft MVP (2006-2020), experienced Enterprise Data Architect ~ Author, Data Community Influencer and DAMA Mentor

Satya Shyam K Jayanty的更多文章

社区洞察

其他会员也浏览了

AWS Data Engineering Essentials Guidebook

Future of Data Analytics with AWS Glue

Serverless Data Processing: The Game-Changer Your Business Needs for 2025

Simplifying Data Workflows with Apache Airflow in Microsoft Fabric

databricks

Why AWS is investing in a zero-ETL future

Building Blocks of a Typical Cloud Data Pipeline

Data Engineering on AWS

How to Choose the Right Data Ingestion Service: AWS, Azure, GCP

What makes BDB delivering @40% TCO

Satya Shyam K Jayanty的更多文章

Alert: ..100% pass guarantee for over 250 certifications... classic "brain dump" or "proxy exam" scam

Data Quality and Consistency: The Foundation of a Successful Data Mesh

Why data governance is critical for Data Mesh?

Data Architecture challenges

#Data #Governance – Navigating #Challenges and Unlocking #Opportunities in a #Digital #Age - 11:00 - 11:45 am (Central US ET) - #DataDays Conference

Data Governance – challenges, strategy and opportunities

Ethical Data Governance: Navigating Challenges, Seizing Opportunities and Building Strategies in Data-Driven Environments

Data Saturday Edinburgh (#DS30) make your day worthwhile

AWS - Amazon Web Services START UP Day Global Series

Compare The Market - Data Science vs Analytical Skills

社区洞察

其他会员也浏览了

AWS Data Engineering Essentials Guidebook

Future of Data Analytics with AWS Glue

Serverless Data Processing: The Game-Changer Your Business Needs for 2025

Simplifying Data Workflows with Apache Airflow in Microsoft Fabric

databricks

Why AWS is investing in a zero-ETL future

Building Blocks of a Typical Cloud Data Pipeline

Data Engineering on AWS

How to Choose the Right Data Ingestion Service: AWS, Azure, GCP

What makes BDB delivering @40% TCO