Data Quality Monitoring
Harihar Mohapatra
Deloitte | Driving Digital Transformations with Engineering, AI & Data | Technology Leader
Data governance and data quality are top-of-mind in more and more Organizations. In today's digital age, data has become a critical asset that guides the direction of businesses of all sizes.
Data quality describes the accuracy, completeness, consistency, and other attributes of data. Organizations need high-quality data that they can trust to make critical decisions. Without high-quality data, organizations cannot become data-driven because they cannot trust their data. The lack of trust hinders the organizations to use their data to make impactful business decisions, leading to inefficiency, missed opportunities, and ultimately, financial loss. Clearly, working with product or customer data from disparate sources without considering Data quality can lead to disastrous results.
Gartner breaks down the Data Quality problem further to these aspects:
Traditional rules-based, manual approach to ensuring data quality.
Key metrics of Data Quality Monitoring
Error ratio
The error ratio measures the proportion of records with errors in a dataset. A high error ratio indicates poor data quality and could lead to incorrect insights or faulty decision-making. Divide the number of records with errors by the total number of entries to calculate the error ratio.
Duplicate record rate
Duplicate records can occur when multiple entries are created for a single entity due to system glitches or human error. The duplicate record rate calculates the percentage of duplicate entries within a given dataset compared to all records.
Data time-to-value
Data time-to-value describes the rate of obtaining value from data after it has been collected. A shorter time-to-value indicates that your organization is efficient at processing and analyzing data for decision-making purposes.
Data quality monitoring techniques
Data profiling
Data profiling is the process of examining, analyzing and understanding the content, structure and relationships within your data.
Data auditing
Data auditing is the process of assessing the accuracy and completeness of data by comparing it against predefined rules or standards. This technique helps organizations identify and track data quality issues, such as missing, incorrect, or inconsistent data.
Data quality rules
Data quality rules are predefined criteria that your data must meet to ensure its accuracy, completeness, consistency and reliability.
Data cleansing
Data cleansing, also known as data scrubbing or data cleaning, is the process of identifying and correcting errors, inconsistencies and inaccuracies in your data.
Real-time data monitoring
Real-time data monitoring is the process of continuously tracking and analyzing data as it is generated, processed and stored within your organization.
Tracking data quality metrics
Data quality metrics are quantitative measures that help organizations assess the quality of their data. These metrics can be used to track and monitor data quality over time, identify trends and patterns and determine the effectiveness of your data quality monitoring techniques.
领英推荐
Data performance testing
Data performance testing is the process of evaluating the efficiency, effectiveness and scalability of your data processing systems and infrastructure.
As technology continues to evolve, several trends are shaping the future of data quality management:
Artificial Intelligence and Machine Learning: AI and ML algorithms are increasingly being used to automate data quality processes, detect anomalies, and improve data cleansing techniques.
Blockchain Technology: Blockchain offers enhanced data integrity and security, reducing the risk of data tampering and ensuring trust in digital transactions.
Regulatory Landscape: Evolving regulatory requirements, such as GDPR and CCPA, are placing greater emphasis on data governance and compliance, driving organizations to prioritize data quality management.
Predictive Analytics: Predictive analytics enables organizations to anticipate and prevent data quality issues before they occur, enabling proactive management of data quality.
Snowflake Data Metric Functions
Data Metric Functions (DMF) are a class of functions that can be used to monitor the quality of your data. There are both “out of the box” functions, provided by Snowflake, and user-defined functions available. Once enabled, these functions can be used to provide regular metrics on data quality issues within the tables you specify.
Consider the following rows of data, from a table called RAW.PAYPAL.PAYMENTS
These rows represent raw data of payments through a payment processor. Sometimes payments fail. It might be a good idea to know how often this happens and enable users to have regular insight into the number of failures over a given period of time. DMFs make this extremely easy.
Steps
ALTER TABLE RAW.PAYPAL.PAYMENT
ADD DATA METRIC FUNCTION fail_status ON (status)
Everything is logged to a table in SNOWFLAKE.LOCAL called DATA_QUALITY_MONITORING_RESULTS_RAW, which is in turn accessed through a view called DATA_QUALITY_MONITORING_RESULTS.
Automation of data quality.
AWS Glue Data automatically computes statistics, recommends quality rules, monitors, and alerts you when it detects issues. For hidden and hard-to-find issues, Glue Data Quality uses ML algorithms. The combined power of rule-based and ML approach, along with the serverless, scalable and open solution, enables you to deliver high quality data to make confident business decisions.?
Telmai Data Observability Platform helps organizations monitor and manage the quality of their data by providing a centralized view of data across all data sources. Telmai’s engine performs data profiling and analysis to identify potential issues, such as missing values, duplicate records, and incorrect data types; ML-based anomaly detection to surface unexpected values in data that may indicate problems and to predict what can be reasonably expected; and continuous monitoring to detect changes in data quality over time.
Google Cloud Dataplex performs data management and governance using machine learning to classify data, organize data in domains, establish data quality, determine data lineage, and both manage and govern the data lifecycle.
Final Words
Mastering data quality management is essential for organizations seeking to unlock the full potential of their data assets. By understanding the dimensions of data quality, addressing common challenges, adopting best practices, and embracing emerging trends, businesses can ensure data integrity, reliability, and relevance in an increasingly data-driven world.
Critical data elements: Identify what is critical for the business; this could be a regulatory report, a cube, or a KPI.
Data value: Estimate the shelf-life of poor data quality or, in other words, the risk associated with bad quality; focus first on those areas with the highest risk.