AWS Glue Data Quality?-?From Raw to?Refined

Introduction

What is Data?Quality?

Almost every enterprise today is data driven, providing a platform for businesses to achieve more customer centric business alignment derived from data analytics, machine learning and Generative AI. With use of data so pervasive, the obvious question enterprises ask themselves is ‘How good is my data and how is the quality of data is ensured to trust it?’

Data quality refers to the state of data or data sets measured against metrics such as accuracy, completeness, consistency and overall reliability of the data to aid enterprises in making sound and good business decisions. Based on GIGO (Garbage In Garbage Out) maxim, if you use bad data, you get bad results. High-quality data is essential for making informed decisions, managing business efficiency, and ensuring the success of data-driven initiatives.

From my own experience, having worked on several big data engineering and analytics projects, I can vouch for how Data Quality issues can effect overall business efficiencies.

Why is Data Quality Important?

Quality data leads to more accurate insights and better-informed decisions. Higher quality data leads to better data driven outcomes.

Clean, and organized data reduces time spent on data cleansing and reconciliation. This is especially true for enterprises which generate data from different sources at a high volume.

Preventing errors caused by poor data quality can save significant resources and avoid bad business decisions.

Accurate customer data improves service levels and communication.

Many industries require maintaining data to certain quality standards, not just for internal usage but also as demanded by external auditors and regulatory agencies.

Key Factors in Assessing Data?Quality

Accuracy: Does the data correctly represent the values it’s supposed to?

Completeness: Are all necessary data points present and do not have missing values or having values which do not make sense?

Consistency: Is the data consistent across different datasets or systems?

Timeliness: Is the data up to date and representative of the current state? This refers to all data sources used in deriving transformed datasets through ETL and used for warehousing the data.

Validity: Does the data conform to the defined format, type, and range?

Uniqueness: Are there duplicate records that could skew analysis?

Common Data Quality?Issues

Lets look some key issues that are applied for data quality assessments. Each of the following can be treated as metric to measure the Data Quality. As an example if I have a dataset which has 1000 rows, and 10 columns, typically you will assess the following and produce a Data Quality report on how many rows and columns are reliable. This is very important to understand. A row may appear complete but bad data in one of the fields in the row effects the overall row integrity. Data Quality analysis when applied across the rows or at columnar level evaluates differently. We will explore this aspect a little more in the next sections.

Duplicate Rows: Multiple identical or nearly identical records.

Missing Values: Absence of data in required fields or NULL values

Inconsistent Formats: Varying date formats or units of measurement within the same column. Or alphanumeric strings in double/floating point fields etc.

Outliers: Data points that significantly differ causing a wide statistical variance. This is typically an observation that lies at an abnormal distance from other values in the same column in a data set.

Data Decay: Outdated information that no longer reflects reality. This in fact can cause analytical outcomes to be non reliable for any data driven decisions.

Invalid Data: Values that don’t meet predefined criteria or business rules.

By addressing these factors and common issues, organizations can significantly improve their data quality, leading to more reliable analytics, improved operational efficiency, and better decision-making capabilities.

AWS Glue Data?Quality

What is AWS Glue Data?Quality?

AWS Glue Data Quality, a feature available in AWS Glue managed service provides a comprehensive serverless framework to define, apply, and monitor data quality rules, seamlessly integrating with defined data workflows. Here’s how it operates:

Rule Definition

At the heart of AWS Glue Data Quality is the Data Quality Definition Language (DQDL), a domain-specific language designed to express data quality rules in a clear and concise manner.

DQDL Overview:

  • Declarative Syntax: DQDL uses a SQL-like syntax, making it accessible for those familiar with SQL.
  • Flexible Rule Definitions: Allows you to specify a wide range of expectations about your data, including allowed values, data formats, patterns, ranges, and statistical properties.
  • Custom Logic: Supports complex conditions and aggregations to cater to intricate data quality requirements.

Rule Application

Once you’ve defined your data quality rules using DQDL, you can apply them directly to your datasets within AWS Glue.

Integration with Data Catalog:

  • Table-Level Application: Apply rules to specific tables or partitions cataloged in the AWS Glue Data Catalog.
  • Dynamic Application: Rules can be applied during ETL jobs, data ingestion processes, or as standalone evaluations. This however can only work in ETL job runs and cant be stand alone Data Quality runs.

Execution of?Rules

  • Automated Evaluation: AWS Glue runs the defined rules against the specified data, checking each record or field as per the rule definitions.
  • Scalability: Leverages AWS Glue’s serverless architecture to scale evaluations based on data size.

Metrics Generation

After the rules are applied, AWS Glue Data Quality generates detailed metrics that provide insights into the health of your data.

Detailed Reports:

  • Pass/Fail Results: For each rule, see how many records passed or failed the validation.
  • Quality Scores: Aggregate scores that represent the overall data quality, helping you quantify data reliability.

Data Quality Dimensions:

  • Completeness, Accuracy, Consistency: Metrics cover various dimensions of data quality, offering a comprehensive view.

Integration with AWS Glue Workflows

Data quality checks are not standalone processes; they can be seamlessly integrated into existing AWS Glue workflows.

ETL Jobs:

  • In-Job Evaluations: Incorporate data quality evaluations within your ETL jobs to ensure data integrity before loading or transforming data.
  • Conditional Actions: Set up workflows that take different paths based on data quality results (e.g., alerting, halting the pipeline, or initiating data remediation processes).

Automation:

  • Scheduled Evaluations: Set up regular data quality checks to monitor data over time.
  • Event-Driven Triggers: Initiate data quality evaluations in response to specific events, such as data arrival with Event Bridge rules and lambda triggers

Monitoring and?Alerting

Maintaining high data quality is an ongoing process, and AWS Glue Data Quality provides tools to monitor and respond to data quality issues promptly.

Alerts and Notifications

  • Custom Thresholds: Define thresholds for data quality scores or specific rule failures to trigger alerts.
  • Integration with AWS Services: Use Amazon CloudWatch and AWS SNS (Simple Notification Service) to receive notifications via email, SMS, or other channels.

Dashboards and Reporting

  • Visualization: Integrate with Amazon QuickSight or other BI tools to visualize data quality trends over time.
  • Stakeholder Communication: Share reports with data engineers, analysts, and business users to foster a data-driven culture.Key Components of AWS Glue Data?Quality

To effectively utilize AWS Glue Data Quality, it’s important to understand its core components:

Data Quality Definition Language?(DQDL)

Purpose:

  • A specialized declarative language for defining data quality rules in an easy to understand format. This is a fairly simple language and doesn't have much learning curve at all.

Features:

  • Expressiveness: Allows for the creation of simple to complex rules, including conditional logic and aggregate functions.
  • Reusability: Write rules once and apply them across multiple datasets or environments.

Example Ruleset in DQDL: The following rules are generated by AWS Data Quality for a Glue table used for the blog. The CSV file that was generated using python faker libraries was tweaked to deliberately introduce Data Quality issues.

The code to generate the CSV file with DQ issues is as below. You can see that 5% of the data which amount 50 problematic rows were generated. You can create possible DQ issues you see as fit.

import pandas as pd
import numpy as np
from faker import Faker
import random
import string

# Initialize Faker
fake = Faker()

# Function to generate alphanumeric strings
def generate_alphanumeric(length):
    return ''.join(random.choices(string.ascii_letters + string.digits, k=length))

# Number of rows in the dataset
num_rows = 1000

# Set random seed for reproducibility (optional)
random.seed(42)
np.random.seed(42)

# Generate the data
data = {
    'ID': range(1, num_rows + 1),
    'Name': [fake.name() for _ in range(num_rows)],
    'Age': [random.randint(18, 80) for _ in range(num_rows)],
    'Email': [fake.email() for _ in range(num_rows)],
    'City': [fake.city() for _ in range(num_rows)],
    'Salary': [round(random.uniform(30000, 150000), 2) for _ in range(num_rows)],
    'Department': [fake.job() for _ in range(num_rows)],
    'EmployeeCode': [generate_alphanumeric(8) for _ in range(num_rows)],
    'JoinDate': [fake.date_between(start_date='-5y', end_date='today') for _ in range(num_rows)],
    'Rating': [round(random.uniform(1, 5), 1) for _ in range(num_rows)]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Introduce data quality issues

# 1. Duplicate rows (1% of the data)
num_duplicates = int(num_rows * 0.01)
duplicate_indices = random.sample(range(num_rows), num_duplicates)
df = pd.concat([df, df.iloc[duplicate_indices]], ignore_index=True)

# 2. Missing values (5% of the data for selected columns)
for column in ['Age', 'Email', 'Salary', 'Department']:
    mask = np.random.rand(len(df)) < 0.05
    df.loc[mask, column] = np.nan

# 3. Outliers
# Age outliers (age > 99)
outlier_indices = random.sample(range(len(df)), int(len(df) * 0.01))  # 1% of data
df.loc[outlier_indices, 'Age'] = [random.randint(100, 120) for _ in range(len(outlier_indices))]

# Salary outliers (very high salaries)
salary_outlier_indices = random.sample(range(len(df)), int(len(df) * 0.01))  # 1% of data
df.loc[salary_outlier_indices, 'Salary'] = [
    round(random.uniform(500000, 1000000), 2) for _ in range(len(salary_outlier_indices))
]

# 4. Inconsistent data formats
# Change 'Age' column dtype to object to allow strings
df['Age'] = df['Age'].astype('object')

# Some ages as strings
age_format_indices = random.sample(range(len(df)), int(len(df) * 0.01))  # 1% of data
df.loc[age_format_indices, 'Age'] = df.loc[age_format_indices, 'Age'].apply(
    lambda x: str(x) if not pd.isna(x) else x
)

# Some emails without domain
email_format_indices = random.sample(range(len(df)), int(len(df) * 0.01))  # 1% of data
df.loc[email_format_indices, 'Email'] = df.loc[email_format_indices, 'Email'].apply(
    lambda x: x.split('@')[0] if isinstance(x, str) else x
)

# Display the first few rows and data info
print(df.head(10))
print("\nDataset Info:")
df.info()

# Display summary statistics
print("\nSummary Statistics:")
print(df.describe(include='all'))

# Save to CSV
df.to_csv('c:/temp/synthetic_dataset_with_issues.csv', index=False)         

The generated csv files with a 1000 rows looks like this (sample row 1) and the rules that were applied are right below the cs

ID,Name,Age,Email,City,Salary,Department,EmployeeCode,JoinDate,Rating
1,Anna Gonzalez,58.0,[email protected],North Ashley,85403.81,Exhibition designer,WwxJLQ6x,2023-05-31,3.3        
# Example rules: Completeness "colA" between 0.4 and 0.8, ColumnCount > 10
# Table schema is ID, Name, Age, Email, City, Salary, Department, EmployeeCode, JoinDate Rating
# Rules applied at column level
Rules = [
    RowCount between 505 and 2020,
    IsComplete "id",
    StandardDeviation "id" between 274.51 and 303.41,
    Uniqueness "id" > 0.95,
    ColumnValues "id" <= 1000,
    IsComplete "name",
    Uniqueness "name" > 0.95,
    ColumnLength "name" between 6 and 24,
    Completeness "age" >= 0.93,
    StandardDeviation "age" between 18.41 and 20.35,
    ColumnValues "age" between 17 and 121,
    IsComplete "email",
    Uniqueness "email" > 0.95,
    ColumnLength "email" <= 32,
    IsComplete "city",
    ColumnLength "city" between 5 and 23,
    Completeness "salary" >= 0.93,
    StandardDeviation "salary" between 73795.65 and 81563.61,
    ColumnValues "salary" between 30320.09 and 970178.6,
    IsComplete "department",
    ColumnLength "department" <= 49,
    IsComplete "employeecode",
    Uniqueness "employeecode" > 0.95,
    ColumnLength "employeecode" = 8,
    IsComplete "joindate",
    ColumnLength "joindate" = 10,
    IsComplete "rating",
    StandardDeviation "rating" between 1.09 and 1.21,
    ColumnValues "rating" <= 5
]        

Quality Results

The ruleset from the above is evaluated against the dataset and the you can see the results on the console. You can download a copy of the results in json format too.

Outcome Reports:

  • Detailed insights into the data quality evaluation results.

Contents:

  • Rule-Level Metrics: Information on which rules passed or failed, along with counts of affected records.
  • Overall Scores: Aggregate data quality scores that summarize the health of the dataset.

Figure 1: Data Quality Metrics from AWS Console

Report that can be downloaded from console or can also be obtained programmatically using AWS SDK. Typically report has the following format, starting with the DQ run job ID and metadata of the Glue Table. Each and every that is applied to the whole dataset is listed by its name, description, the ruleset application result and the evaluated metric. The report is very comprehensive and can be used in jupyter notebook with seaborn or other python libraries to create nice graphical views.

{
 "ResultId": "dqresult-303093db363f2d2a04a9419f3272c7871438907a",
 "Score": 0.9,
 "DataSource": {
  "GlueTable": {
   "DatabaseName": "dqblogdb",
   "TableName": "dqblogbucket",
   "CatalogId": "873909134233"
  }
 },
 "RulesetName": "dqblogruleset",
 "StartedOn": "2024-09-28T13:10:30.811Z",
 "CompletedOn": "2024-09-28T13:11:00.394Z",
 "RulesetEvaluationRunId": "dqrun-625f2914a79e3b323e0335ce18ec349659f1414f",
 "RuleResults": [
  {
   "Name": "Rule_1",
   "Description": "RowCount between 505 and 2020",
   "Result": "PASS",
   "EvaluatedMetrics": {
    "Dataset.*.RowCount": 1010
   }
  },
  {
   "Name": "Rule_2",
   "Description": "IsComplete \"id\"",
   "Result": "PASS",
   "EvaluatedMetrics": {
    "Column.id.Completeness": 1
   }
  },        

Once I download the report, quickly I was able to plot the failed rules and the number of failures that have occurred for the failed rules using my jupyter lab notebook and python seaborn libraries. This is a fairly trivial example to plot, however more complex plots can be generated when used with large DQ reports.

Figure 2: Failed Rules

Integration with AWS Glue Data?Catalog

It is also possible to use Data Quality rules with AWS Glue Data Catalog to enforce rules on cataloged datasets. AWS Glue Data Quality allows you to evaluate and monitor data quality based on user-defined rules. Here’s how you can implement this.

Create Data Quality Rules: You can define data quality rules using the Data Quality Definition Language (DQDL). These rules can check various characteristics of your datasets, such as completeness, uniqueness, data types, and more.

Apply Rules to Cataloged Data: Once you’ve created your rules, you can apply them to tables in your AWS Glue Data Catalog. This allows you to enforce data quality standards on your cataloged datasets.

Run Data Quality Evaluations: You can run data quality evaluations on your cataloged data using the defined rules. This can be done through the AWS Glue console, AWS Glue Studio, or programmatically using AWS Glue APIs.

Review Results: After running an evaluation, you can review the results to identify any data quality issues in your cataloged datasets.

Automate Data Quality Checks: You can integrate these data quality checks into your ETL pipelines or set up scheduled evaluations to continuously monitor the quality of your cataloged data.

Take Action on Results: Based on the evaluation results, you can set up actions to be taken when data quality issues are detected, such as alerting or triggering remediation workflows.

Integration with AWS?DataZone

You can use AWS Glue Data Quality metrics in Amazon DataZone to enhance data governance and provide valuable insights to data consumers. Here’s how you can leverage this integration:

Enable Data Quality for AWS Glue Data Source:

  • In the Amazon DataZone Data Portal, when creating or editing an AWS Glue data source, enable the data quality option.
  • Alternatively, use the Amazon Data Zone APIs (CreateDataSource or UpdateDataSource) and set the autoImportDataQualityResult parameter to ‘True’. Please refer to AWS Data Zone APIs documentation for more info.

Automatic Import of Quality Metrics:

  • Once enabled, Amazon DataZone will automatically pull data quality metrics from AWS Glue.
  • These metrics will be imported on a schedule or can be run on-demand.

Viewing Quality Metrics:

  • Data consumers can easily search for data assets in the Amazon DataZone business data catalog.
  • They can view granular quality metrics and identify failed checks and rules for each asset.

Historical Data:

  • Amazon DataZone stores up to 30 data points per rule, allowing users to see how data quality metrics change over time.

Unified Dashboard:

  • Quality scores from AWS Glue Data Quality are unified into intuitive dashboards within Amazon DataZone.
  • This provides a comprehensive view of data quality across your organization.
  • As new quality metrics are added to an asset in AWS Glue, they are automatically reflected in Amazon DataZone without needing to republish the asset.

Third-Party Integration:

  • Amazon DataZone also offers APIs to ingest quality metrics from third-party data quality solutions, providing a unified view of data quality across different tools.

By using this integration, you can improve data trust, make more confident business decisions, and provide valuable context to data consumers within your organization. Remember to ensure you have the necessary permissions set up for both AWS Glue and Amazon DataZone to enable this integration.

Conclusions

In today’s data-driven world, the quality of your data is just as important as the quantity. Data Quality (DQ) serves as the foundation upon which reliable analytics, insightful business intelligence, and effective decision-making are built. Poor data quality can lead to misguided strategies, operational inefficiencies, and lost revenue opportunities.

Throughout this blog, we’ve explored the critical role of data quality and how AWS Glue Data Quality offers a comprehensive solution to address this vital aspect of data management. By leveraging AWS Glue Data Quality, you can:

  • Understand Data Quality: Recognize the importance of data accuracy, consistency, completeness, and reliability in driving business success.
  • Utilize AWS Glue Data Quality Features: Employ AWS Glue’s robust tools to define, apply, and monitor data quality rules within your data pipelines.
  • Create and Manage Rule Sets: Use the Data Quality Definition Language (DQDL) to craft precise and complex rule sets that encapsulate your data quality expectations.
  • Integrate Seamlessly: Incorporate data quality checks into your existing AWS Glue workflows and ETL processes without disrupting your operations. Integrate with AWS DataZone for data governance.
  • Generate Actionable Metrics: Obtain detailed insights and metrics on data quality, enabling proactive issue detection and resolution.

By creating rule sets with DQDL, you gain the ability to express complex data quality rules in an organized and reusable manner. This not only streamlines the validation process but also ensures consistency and compliance across different datasets and projects.

Adopting AWS Glue Data Quality empowers your organization to take control of data integrity proactively. It allows you to automate quality checks, reduce manual errors, and maintain high standards as your data scales. With AWS Glue’s integration capabilities, you can seamlessly weave data quality into every stage of your data lifecycle.

Looking Ahead

Investing in data quality is not just a technical necessity?—?it’s a strategic imperative. High-quality data enhances customer satisfaction, drives innovation, and provides a competitive advantage in the marketplace. By integrating AWS Glue Data Quality into your data management practices, you’re laying a solid foundation for reliable analytics and informed decision-making.

As you continue on your data journey, remember that the pursuit of data quality is an ongoing process. AWS Glue Data Quality offers the tools and flexibility to adapt to evolving data landscapes and business needs. By prioritizing data quality today, you’re setting your organization up for success tomorrow. Explore the power of AWS Glue Data Quality to unlock the full potential of your data. With accurate, consistent, and trustworthy data at your fingertips, the possibilities are endless.


要查看或添加评论,请登录

Ravi (拉维) C.的更多文章

社区洞察

其他会员也浏览了