AWS Glue Data Quality?-?From Raw to?Refined
Introduction
What is Data?Quality?
Almost every enterprise today is data driven, providing a platform for businesses to achieve more customer centric business alignment derived from data analytics, machine learning and Generative AI. With use of data so pervasive, the obvious question enterprises ask themselves is ‘How good is my data and how is the quality of data is ensured to trust it?’
Data quality refers to the state of data or data sets measured against metrics such as accuracy, completeness, consistency and overall reliability of the data to aid enterprises in making sound and good business decisions. Based on GIGO (Garbage In Garbage Out) maxim, if you use bad data, you get bad results. High-quality data is essential for making informed decisions, managing business efficiency, and ensuring the success of data-driven initiatives.
From my own experience, having worked on several big data engineering and analytics projects, I can vouch for how Data Quality issues can effect overall business efficiencies.
Why is Data Quality Important?
Quality data leads to more accurate insights and better-informed decisions. Higher quality data leads to better data driven outcomes.
Clean, and organized data reduces time spent on data cleansing and reconciliation. This is especially true for enterprises which generate data from different sources at a high volume.
Preventing errors caused by poor data quality can save significant resources and avoid bad business decisions.
Accurate customer data improves service levels and communication.
Many industries require maintaining data to certain quality standards, not just for internal usage but also as demanded by external auditors and regulatory agencies.
Key Factors in Assessing Data?Quality
Accuracy: Does the data correctly represent the values it’s supposed to?
Completeness: Are all necessary data points present and do not have missing values or having values which do not make sense?
Consistency: Is the data consistent across different datasets or systems?
Timeliness: Is the data up to date and representative of the current state? This refers to all data sources used in deriving transformed datasets through ETL and used for warehousing the data.
Validity: Does the data conform to the defined format, type, and range?
Uniqueness: Are there duplicate records that could skew analysis?
Common Data Quality?Issues
Lets look some key issues that are applied for data quality assessments. Each of the following can be treated as metric to measure the Data Quality. As an example if I have a dataset which has 1000 rows, and 10 columns, typically you will assess the following and produce a Data Quality report on how many rows and columns are reliable. This is very important to understand. A row may appear complete but bad data in one of the fields in the row effects the overall row integrity. Data Quality analysis when applied across the rows or at columnar level evaluates differently. We will explore this aspect a little more in the next sections.
Duplicate Rows: Multiple identical or nearly identical records.
Missing Values: Absence of data in required fields or NULL values
Inconsistent Formats: Varying date formats or units of measurement within the same column. Or alphanumeric strings in double/floating point fields etc.
Outliers: Data points that significantly differ causing a wide statistical variance. This is typically an observation that lies at an abnormal distance from other values in the same column in a data set.
Data Decay: Outdated information that no longer reflects reality. This in fact can cause analytical outcomes to be non reliable for any data driven decisions.
Invalid Data: Values that don’t meet predefined criteria or business rules.
By addressing these factors and common issues, organizations can significantly improve their data quality, leading to more reliable analytics, improved operational efficiency, and better decision-making capabilities.
AWS Glue Data?Quality
What is AWS Glue Data?Quality?
AWS Glue Data Quality, a feature available in AWS Glue managed service provides a comprehensive serverless framework to define, apply, and monitor data quality rules, seamlessly integrating with defined data workflows. Here’s how it operates:
Rule Definition
At the heart of AWS Glue Data Quality is the Data Quality Definition Language (DQDL), a domain-specific language designed to express data quality rules in a clear and concise manner.
DQDL Overview:
Rule Application
Once you’ve defined your data quality rules using DQDL, you can apply them directly to your datasets within AWS Glue.
Integration with Data Catalog:
Execution of?Rules
Metrics Generation
After the rules are applied, AWS Glue Data Quality generates detailed metrics that provide insights into the health of your data.
Detailed Reports:
Data Quality Dimensions:
Integration with AWS Glue Workflows
Data quality checks are not standalone processes; they can be seamlessly integrated into existing AWS Glue workflows.
ETL Jobs:
Automation:
Monitoring and?Alerting
Maintaining high data quality is an ongoing process, and AWS Glue Data Quality provides tools to monitor and respond to data quality issues promptly.
Alerts and Notifications
领英推荐
Dashboards and Reporting
To effectively utilize AWS Glue Data Quality, it’s important to understand its core components:
Data Quality Definition Language?(DQDL)
Purpose:
Features:
Example Ruleset in DQDL: The following rules are generated by AWS Data Quality for a Glue table used for the blog. The CSV file that was generated using python faker libraries was tweaked to deliberately introduce Data Quality issues.
The code to generate the CSV file with DQ issues is as below. You can see that 5% of the data which amount 50 problematic rows were generated. You can create possible DQ issues you see as fit.
import pandas as pd
import numpy as np
from faker import Faker
import random
import string
# Initialize Faker
fake = Faker()
# Function to generate alphanumeric strings
def generate_alphanumeric(length):
return ''.join(random.choices(string.ascii_letters + string.digits, k=length))
# Number of rows in the dataset
num_rows = 1000
# Set random seed for reproducibility (optional)
random.seed(42)
np.random.seed(42)
# Generate the data
data = {
'ID': range(1, num_rows + 1),
'Name': [fake.name() for _ in range(num_rows)],
'Age': [random.randint(18, 80) for _ in range(num_rows)],
'Email': [fake.email() for _ in range(num_rows)],
'City': [fake.city() for _ in range(num_rows)],
'Salary': [round(random.uniform(30000, 150000), 2) for _ in range(num_rows)],
'Department': [fake.job() for _ in range(num_rows)],
'EmployeeCode': [generate_alphanumeric(8) for _ in range(num_rows)],
'JoinDate': [fake.date_between(start_date='-5y', end_date='today') for _ in range(num_rows)],
'Rating': [round(random.uniform(1, 5), 1) for _ in range(num_rows)]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Introduce data quality issues
# 1. Duplicate rows (1% of the data)
num_duplicates = int(num_rows * 0.01)
duplicate_indices = random.sample(range(num_rows), num_duplicates)
df = pd.concat([df, df.iloc[duplicate_indices]], ignore_index=True)
# 2. Missing values (5% of the data for selected columns)
for column in ['Age', 'Email', 'Salary', 'Department']:
mask = np.random.rand(len(df)) < 0.05
df.loc[mask, column] = np.nan
# 3. Outliers
# Age outliers (age > 99)
outlier_indices = random.sample(range(len(df)), int(len(df) * 0.01)) # 1% of data
df.loc[outlier_indices, 'Age'] = [random.randint(100, 120) for _ in range(len(outlier_indices))]
# Salary outliers (very high salaries)
salary_outlier_indices = random.sample(range(len(df)), int(len(df) * 0.01)) # 1% of data
df.loc[salary_outlier_indices, 'Salary'] = [
round(random.uniform(500000, 1000000), 2) for _ in range(len(salary_outlier_indices))
]
# 4. Inconsistent data formats
# Change 'Age' column dtype to object to allow strings
df['Age'] = df['Age'].astype('object')
# Some ages as strings
age_format_indices = random.sample(range(len(df)), int(len(df) * 0.01)) # 1% of data
df.loc[age_format_indices, 'Age'] = df.loc[age_format_indices, 'Age'].apply(
lambda x: str(x) if not pd.isna(x) else x
)
# Some emails without domain
email_format_indices = random.sample(range(len(df)), int(len(df) * 0.01)) # 1% of data
df.loc[email_format_indices, 'Email'] = df.loc[email_format_indices, 'Email'].apply(
lambda x: x.split('@')[0] if isinstance(x, str) else x
)
# Display the first few rows and data info
print(df.head(10))
print("\nDataset Info:")
df.info()
# Display summary statistics
print("\nSummary Statistics:")
print(df.describe(include='all'))
# Save to CSV
df.to_csv('c:/temp/synthetic_dataset_with_issues.csv', index=False)
The generated csv files with a 1000 rows looks like this (sample row 1) and the rules that were applied are right below the cs
ID,Name,Age,Email,City,Salary,Department,EmployeeCode,JoinDate,Rating
1,Anna Gonzalez,58.0,[email protected],North Ashley,85403.81,Exhibition designer,WwxJLQ6x,2023-05-31,3.3
# Example rules: Completeness "colA" between 0.4 and 0.8, ColumnCount > 10
# Table schema is ID, Name, Age, Email, City, Salary, Department, EmployeeCode, JoinDate Rating
# Rules applied at column level
Rules = [
RowCount between 505 and 2020,
IsComplete "id",
StandardDeviation "id" between 274.51 and 303.41,
Uniqueness "id" > 0.95,
ColumnValues "id" <= 1000,
IsComplete "name",
Uniqueness "name" > 0.95,
ColumnLength "name" between 6 and 24,
Completeness "age" >= 0.93,
StandardDeviation "age" between 18.41 and 20.35,
ColumnValues "age" between 17 and 121,
IsComplete "email",
Uniqueness "email" > 0.95,
ColumnLength "email" <= 32,
IsComplete "city",
ColumnLength "city" between 5 and 23,
Completeness "salary" >= 0.93,
StandardDeviation "salary" between 73795.65 and 81563.61,
ColumnValues "salary" between 30320.09 and 970178.6,
IsComplete "department",
ColumnLength "department" <= 49,
IsComplete "employeecode",
Uniqueness "employeecode" > 0.95,
ColumnLength "employeecode" = 8,
IsComplete "joindate",
ColumnLength "joindate" = 10,
IsComplete "rating",
StandardDeviation "rating" between 1.09 and 1.21,
ColumnValues "rating" <= 5
]
Quality Results
The ruleset from the above is evaluated against the dataset and the you can see the results on the console. You can download a copy of the results in json format too.
Outcome Reports:
Contents:
Report that can be downloaded from console or can also be obtained programmatically using AWS SDK. Typically report has the following format, starting with the DQ run job ID and metadata of the Glue Table. Each and every that is applied to the whole dataset is listed by its name, description, the ruleset application result and the evaluated metric. The report is very comprehensive and can be used in jupyter notebook with seaborn or other python libraries to create nice graphical views.
{
"ResultId": "dqresult-303093db363f2d2a04a9419f3272c7871438907a",
"Score": 0.9,
"DataSource": {
"GlueTable": {
"DatabaseName": "dqblogdb",
"TableName": "dqblogbucket",
"CatalogId": "873909134233"
}
},
"RulesetName": "dqblogruleset",
"StartedOn": "2024-09-28T13:10:30.811Z",
"CompletedOn": "2024-09-28T13:11:00.394Z",
"RulesetEvaluationRunId": "dqrun-625f2914a79e3b323e0335ce18ec349659f1414f",
"RuleResults": [
{
"Name": "Rule_1",
"Description": "RowCount between 505 and 2020",
"Result": "PASS",
"EvaluatedMetrics": {
"Dataset.*.RowCount": 1010
}
},
{
"Name": "Rule_2",
"Description": "IsComplete \"id\"",
"Result": "PASS",
"EvaluatedMetrics": {
"Column.id.Completeness": 1
}
},
Once I download the report, quickly I was able to plot the failed rules and the number of failures that have occurred for the failed rules using my jupyter lab notebook and python seaborn libraries. This is a fairly trivial example to plot, however more complex plots can be generated when used with large DQ reports.
Integration with AWS Glue Data?Catalog
It is also possible to use Data Quality rules with AWS Glue Data Catalog to enforce rules on cataloged datasets. AWS Glue Data Quality allows you to evaluate and monitor data quality based on user-defined rules. Here’s how you can implement this.
Create Data Quality Rules: You can define data quality rules using the Data Quality Definition Language (DQDL). These rules can check various characteristics of your datasets, such as completeness, uniqueness, data types, and more.
Apply Rules to Cataloged Data: Once you’ve created your rules, you can apply them to tables in your AWS Glue Data Catalog. This allows you to enforce data quality standards on your cataloged datasets.
Run Data Quality Evaluations: You can run data quality evaluations on your cataloged data using the defined rules. This can be done through the AWS Glue console, AWS Glue Studio, or programmatically using AWS Glue APIs.
Review Results: After running an evaluation, you can review the results to identify any data quality issues in your cataloged datasets.
Automate Data Quality Checks: You can integrate these data quality checks into your ETL pipelines or set up scheduled evaluations to continuously monitor the quality of your cataloged data.
Take Action on Results: Based on the evaluation results, you can set up actions to be taken when data quality issues are detected, such as alerting or triggering remediation workflows.
Integration with AWS?DataZone
You can use AWS Glue Data Quality metrics in Amazon DataZone to enhance data governance and provide valuable insights to data consumers. Here’s how you can leverage this integration:
Enable Data Quality for AWS Glue Data Source:
Automatic Import of Quality Metrics:
Viewing Quality Metrics:
Historical Data:
Unified Dashboard:
Third-Party Integration:
By using this integration, you can improve data trust, make more confident business decisions, and provide valuable context to data consumers within your organization. Remember to ensure you have the necessary permissions set up for both AWS Glue and Amazon DataZone to enable this integration.
Conclusions
In today’s data-driven world, the quality of your data is just as important as the quantity. Data Quality (DQ) serves as the foundation upon which reliable analytics, insightful business intelligence, and effective decision-making are built. Poor data quality can lead to misguided strategies, operational inefficiencies, and lost revenue opportunities.
Throughout this blog, we’ve explored the critical role of data quality and how AWS Glue Data Quality offers a comprehensive solution to address this vital aspect of data management. By leveraging AWS Glue Data Quality, you can:
By creating rule sets with DQDL, you gain the ability to express complex data quality rules in an organized and reusable manner. This not only streamlines the validation process but also ensures consistency and compliance across different datasets and projects.
Adopting AWS Glue Data Quality empowers your organization to take control of data integrity proactively. It allows you to automate quality checks, reduce manual errors, and maintain high standards as your data scales. With AWS Glue’s integration capabilities, you can seamlessly weave data quality into every stage of your data lifecycle.
Looking Ahead
Investing in data quality is not just a technical necessity?—?it’s a strategic imperative. High-quality data enhances customer satisfaction, drives innovation, and provides a competitive advantage in the marketplace. By integrating AWS Glue Data Quality into your data management practices, you’re laying a solid foundation for reliable analytics and informed decision-making.
As you continue on your data journey, remember that the pursuit of data quality is an ongoing process. AWS Glue Data Quality offers the tools and flexibility to adapt to evolving data landscapes and business needs. By prioritizing data quality today, you’re setting your organization up for success tomorrow. Explore the power of AWS Glue Data Quality to unlock the full potential of your data. With accurate, consistent, and trustworthy data at your fingertips, the possibilities are endless.