登录查看更多内容

Change Data Capture(CDC) using AWS Glue in Batch ETL Process.

Kushwanth Chowdary Kandala

Technical Lead | AI & Data

发布日期: 2023年11月24日

In AWS Glue, job bookmarks are a feature that helps track the state of data processing in your ETL (Extract, Transform, Load) jobs. Job bookmarks are not the same as CDC (Change Data Capture), but they serve a related purpose.

Here's a brief overview of both concepts:

Job Bookmarks:Job bookmarks in AWS Glue are used to keep track of the processed data in your ETL jobs.They record the state of the data processing, including the last successfully processed records.Job bookmarks help your ETL jobs resume processing from where they left off in case of failures or interruptions.While job bookmarks don't inherently capture changes in the data, they help optimize the processing of large datasets by avoiding reprocessing already processed data.
Change Data Capture (CDC):CDC, on the other hand, is a concept that specifically deals with identifying and capturing changes in data over time. CDC is commonly used when you need to keep track of changes to individual records in a database, distinguishing between inserts, updates, and deletes. CDC is more focused on capturing and managing changes to data, often involving the use of timestamps, change flags, or other mechanisms to identify modified records.

While job bookmarks can be used to implement a form of incremental processing and avoid unnecessary reprocessing of data, they do not inherently capture or track changes in data. If you need to implement a full CDC solution, you would typically use specific CDC tools or techniques, such as comparing timestamps, change flags, or versioning.

In summary, job bookmarks in AWS Glue are more about job state management and optimizing ETL job execution, while CDC is about capturing and managing changes to data. Depending on your use case, you might use job bookmarks alongside other techniques to achieve the desired outcome, but they are not a CDC solution in themselves.

Change Data Capture (CDC) tools:

Change Data Capture (CDC) tools are designed to identify and capture changes made to data so that these changes can be tracked, logged, and replicated to other systems. Here are some popular CDC tools:

Debezium:Features: Open-source CDC tool for capturing row-level changes in databases.Supported Databases: MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and others.Integration: Integrates with Apache Kafka for scalable and fault-tolerant change data streaming.
Attunity (now part of Qlik):Features: Offers CDC solutions for various databases and data warehouses.Supported Databases: Oracle, SQL Server, SAP HANA, Teradata, and others.Integration: Integrates with platforms like Microsoft Azure Data Factory and Qlik Replicate.
GoldenGate (Oracle):Features: Enterprise-level CDC solution for Oracle databases.Supported Databases: Primarily designed for Oracle databases.Integration: Integrates with Oracle databases and supports real-time data replication.
AWS Database Migration Service (DMS):Features: Fully managed service for migrating and replicating data between different databases.Supported Databases: Supports a variety of source and target databases.Integration: Native integration with AWS services and databases.
Microsoft SQL Server Change Data Capture:Features: Built-in CDC functionality for SQL Server databases.Supported Databases: SQL Server databases.Integration: Integrated into SQL Server Management Studio (SSMS).
Talend:Features: Open-source data integration tool with CDC capabilities.Supported Databases: Supports various databases and data sources.Integration: Offers integration with multiple platforms and databases.
IBM InfoSphere Data Replication:Features: Provides CDC capabilities for IBM Db2, Oracle, SQL Server, and others.Supported Databases: Db2, Oracle, SQL Server, and more.Integration: Integrates with IBM Db2 and other databases.
HVR:Features: CDC solution for real-time data integration and replication.Supported Databases: Supports various databases and data warehouses.Integration: Integrates with cloud platforms and databases.

When choosing a CDC tool, consider factors such as the databases you are using, the scale of your data, the required latency, and your overall architecture. Additionally, the specific features, ease of use, and compatibility with your existing technology stack are important considerations. Always refer to the documentation of the respective tools for the most up-to-date information.

This Article Covers 3 - simple way to handle different cases of handing source data changes in the

CASE 1: This works only when you have "your_timestamp_column" last updated timestamp column and "your_key_column" is a unique identifier.

Implementing Change Data Capture (CDC) using AWS Glue involves comparing the current state of your data with a previous state to identify changes. Below is a simplified example using Python and PySpark in an AWS Glue ETL script. This script assumes you are comparing data in two tables, such as "source" and "destination," and identifies new or updated records.

from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import ApplyMapping, DropFields, SelectFields

# Create a GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)

# Define your source and destination tables
source_table = "your_source_table"
destination_table = "your_destination_table"

# Create DynamicFrames for source and destination data
source_dyf = glueContext.create_dynamic_frame.from_catalog(database = "your_database", table_name = source_table)
destination_dyf = glueContext.create_dynamic_frame.from_catalog(database = "your_database", table_name = destination_table)

# Identify changes using PySpark transformations
joined_dyf = source_dyf.join(
    frame1 = source_dyf, keys1 = ["your_key_column"], 
    frame2 = destination_dyf, keys2 = ["your_key_column"], 
    transformation_ctx = "join"
)

changes_dyf = joined_dyf.filter(
    "source_table.your_timestamp_column > destination_table.your_timestamp_column or destination_table.your_key_column is null"
)

# Apply transformations or filters as needed
mapped_dyf = ApplyMapping.apply(
    frame = changes_dyf, 
    mappings = [
        ("source_table.column1", "string", "destination_table.column1"),
        ("source_table.column2", "string", "destination_table.column2"),
        # Add more columns as needed
    ], 
    transformation_ctx = "mapping"
)

# Select only the required fields
selected_dyf = SelectFields.apply(
    frame = mapped_dyf,
    paths = ["destination_table.column1", "destination_table.column2", ...],
    transformation_ctx = "select_fields"
)

# Drop unnecessary fields
final_dyf = DropFields.apply(
    frame = selected_dyf,
    paths = ["join", "mapping", "select_fields"],
    transformation_ctx = "drop_fields"
)

# Write the changes to the destination
glueContext.write_dynamic_frame.from_catalog(frame = final_dyf, database = "your_database", table_name = "your_cdc_table")

# Commit the changes
job.commit()

In this example:

Replace "your_source_table," "your_destination_table," and other placeholders with your actual table and column names.
Adjust the logic in the transformation steps based on your specific requirements.
Ensure that the "your_key_column" is a unique identifier for each record.
The script assumes you have a "your_timestamp_column" that indicates when the record was last updated.

This is a simplified example, and depending on your use case and data structure, you might need to adapt the script accordingly. Additionally, consider using a more efficient and scalable solution for large datasets, such as AWS Glue ETL jobs with dynamic dataframes.

CASE 2: This works only when you have "your_key_column" unique identifier and has no "your_timestamp_column" last updated timestamp column.

Implementing Change Data Capture (CDC) without a specific timestamp column in the source and destination tables can be more challenging, as the typical approach involves comparing the data based on some form of last modified timestamp or incremental key. Here's a simplified example using AWS Glue without relying on a timestamp column.

from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Join, DropFields

# Create a GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)

# Define your source and destination tables
source_table = "your_source_table"
destination_table = "your_destination_table"

# Create DynamicFrames for source and destination data
source_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=source_table)
destination_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=destination_table)

# Perform an outer join to identify new and updated records
joined_dyf = Join.apply(source_dyf, destination_dyf, 'your_key_column', 'outer')

# Identify changes by filtering records with differences between source and destination
changes_dyf = joined_dyf.filter(
    "source_table.your_key_column is not null and (source_table != destination_table or destination_table.your_key_column is null)"
)

# Drop unnecessary fields
final_dyf = DropFields.apply(
    frame=changes_dyf,
    paths=["source_table.your_key_column", "destination_table.your_key_column"],
    transformation_ctx="drop_fields"
)

# Write the changes to the destination
glueContext.write_dynamic_frame.from_catalog(frame=final_dyf, database="your_database", table_name="your_cdc_table")

# Commit the changes
job.commit()

In this example:

Replace "your_source_table," "your_destination_table," and other placeholders with your actual table and column names.
Ensure that "your_key_column" is a unique identifier for each record.
The script uses an outer join to identify new and updated records. Records with differences between source and destination are considered changes.
The last part of the script drops unnecessary fields (key columns from both source and destination) before writing the changes to the destination.

Keep in mind that without a specific timestamp, detecting updates relies on changes in the data itself. If your source and destination tables are large, consider using more efficient techniques or tools for CDC, and always test the solution thoroughly to ensure it meets your requirements.

The resolveChoice transformation in AWS Glue is used to handle choices made during dynamic frame casting or resolving column types. It allows you to handle cases where column types are ambiguous or need to be explicitly defined.

In the provided example, resolveChoice is used in the context of casting the column "your_key_column" to a long type. Here's the relevant part of the code:

领英推荐

Best ETL Tools For AWS

Hexaview Technologies Inc. 1 年前

How to Master ETL Processes for Clean and Usable Data

Muhammad Ishtiaq Khan 6 个月前

ETL with Mage is like the secret sauce that helps you…

Rana Sheharyar 11 个月前

changes_dyf = changes_dyf.resolveChoice(
    specs=[('source_table.your_key_column', 'cast:long')],
    choice='make_cols'
)

Here's what each parameter in resolveChoice does:

specs: This parameter is a list of tuples where each tuple contains two elements:The first element is the path to the column for which the choice is being resolved ('source_table.your_key_column' in this case).The second element is the choice type or action to be performed ('cast:long' in this case, indicating that the column should be cast to a long type).
choice: This parameter specifies how to handle the choices. In this case, 'make_cols' is used, indicating that columns should be created for each choice.

In simpler terms, the resolveChoice transformation is telling AWS Glue to cast the "your_key_column" in the changes_dyf dynamic frame to a long type. This can be necessary when dealing with mixed types or when the type of a column needs to be explicitly defined for further processing.

This is just one use case of resolveChoice. Depending on your specific data and requirements, you might need to use it differently. Always refer to the AWS Glue documentation for the most accurate and detailed information on transformations and their parameters.

CASE 3: This works only when you have "your_key_column" unique identifier and has no "your_timestamp_column" last updated timestamp column using change flag.

Implementing Change Data Capture (CDC) with upserts (inserts and updates) without a specific timestamp column in the source and destination tables is more complex. In this scenario, you might need to rely on some form of change tracking, such as a change flag or a versioning mechanism. Below is a simplified example using AWS Glue.

from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Join, DropFields, SelectFields

# Create a GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)

# Define your source and destination tables
source_table = "your_source_table"
destination_table = "your_destination_table"

# Create DynamicFrames for source and destination data
source_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=source_table)
destination_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=destination_table)

# Perform an outer join to identify new and updated records
joined_dyf = Join.apply(source_dyf, destination_dyf, 'your_key_column', 'outer')

# Identify changes by filtering records with differences between source and destination
changes_dyf = joined_dyf.filter(
    "source_table.your_key_column is not null and (source_table != destination_table or destination_table.your_key_column is null)"
)

# Add a change flag to indicate insert (I) or update (U)
changes_dyf = changes_dyf.resolveChoice(
    specs=[('source_table.your_key_column', 'cast:long')],
    choice='make_cols'
)

# Filter only the necessary columns for upsert
upsert_dyf = SelectFields.apply(
    frame=changes_dyf,
    paths=["source_table.column1", "source_table.column2", ...],
    transformation_ctx="select_fields"
)

# Write the changes to the destination
glueContext.write_dynamic_frame.from_catalog(frame=upsert_dyf, database="your_database", table_name="your_cdc_table")

# Commit the changes
job.commit()

In this example:

Replace "your_source_table," "your_destination_table," and other placeholders with your actual table and column names.
Ensure that "your_key_column" is a unique identifier for each record.
The script uses an outer join to identify new and updated records. Records with differences between source and destination are considered changes.
The script adds a change flag to indicate whether a record should be inserted or updated.
The last part of the script filters and selects only the necessary columns for the upsert operation.

CASE 4: This is when you don't have unique key or no last updated timestamp

Handling Change Data Capture (CDC) and upserts without a timestamp column or a unique identifier (key) in the source and destination tables can be challenging. In such cases, you might need to rely on other mechanisms, such as hashing the entire record or using a combination of fields to identify changes. Keep in mind that these methods might not be as reliable as having a dedicated timestamp column or a unique key.

Here's a simplified example using AWS Glue where we use a hash of the entire record to identify changes.

from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Join, DropFields, SelectFields, ApplyMapping

# Create a GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)

# Define your source and destination tables
source_table = "your_source_table"
destination_table = "your_destination_table"

# Create DynamicFrames for source and destination data
source_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=source_table)
destination_dyf = glueContext.create_dynamic_frame.from_catalog(database="your_database", table_name=destination_table)

# Perform a full outer join to identify new and updated records
joined_dyf = Join.apply(source_dyf, destination_dyf, 'and'.join(source_dyf.schema().names()), 'outer')

# Identify changes by filtering records with differences between source and destination
changes_dyf = joined_dyf.filter(
    "source_table is not null and (source_table != destination_table or destination_table is null)"
)

# Apply a mapping to create a hash of the entire record
hash_mapping = [
    ("source_table.*", "string", "record_hash")
]
changes_dyf = ApplyMapping.apply(
    frame=changes_dyf,
    mappings=hash_mapping,
    transformation_ctx="hash_mapping"
)

# Drop unnecessary fields
final_dyf = DropFields.apply(
    frame=changes_dyf,
    paths=["source_table"],
    transformation_ctx="drop_fields"
)

# Write the changes to the destination
glueContext.write_dynamic_frame.from_catalog(frame=final_dyf, database="your_database", table_name="your_cdc_table")

# Commit the changes
job.commit()

In this example:

The full outer join ('outer') is used to identify new and updated records.
The ApplyMapping transformation is used to create a hash of the entire record from the source table (record_hash).
The script then drops unnecessary fields, leaving only the hash and any other relevant fields for further processing.
The resulting final_dyf contains the information needed to identify changes, although using a hash might not be as reliable as a timestamp or key.

Keep in mind that this is a workaround and might not be suitable for all scenarios. If possible, consider modifying your data model to include a timestamp or a unique identifier to make change tracking more robust. Always thoroughly test any CDC implementation to ensure it meets your requirements.

Conclusion:

As you see in the last case when there is no unique key and last updated timestamp things are getting complicated in the ETL process.

without a specific timestamp, detecting updates relies on changes in the data itself and when source and destination tables are large, consider using more efficient techniques or tools for CDC.

要查看或添加评论，请登录

Kushwanth Chowdary Kandala的更多文章

Retail Market Analytics

2023年11月11日

Retail Market Analytics

Building Data Lake provides a central repository for Data Platform that helps retailers analyze a wide variety of…
AutoML + Streamlit

2023年10月30日

AutoML + Streamlit

There are many AutoML libraries like. GCP AutoML Amazon SageMaker Autopilot PyCaret H2O AutoML TPOT Auto-sklearn FLAML…
Credit Risk Model

2023年10月23日

Credit Risk Model

Credit Risk in the Fintech Industry. This is to build and train a model that identifies full-paid and Charged-off loans…
Confluent Kafka Avro Schema Registry

2021年5月1日

Confluent Kafka Avro Schema Registry

Produce real-time data and consume using confluent Kafka with Avro Schema Registry Why Avro Schema? Apache Avro is a…
Classifying Bird Species Using FastAI

2020年4月24日

Classifying Bird Species Using FastAI

Article 1 "Things might be bored and frustrating with the COVID-19 Quarantine. This is the best opportunity to utilize…

See all articles

Change Data Capture(CDC) using AWS Glue in Batch ETL Process.

Kushwanth Chowdary Kandala

Technical Lead | AI & Data

Change Data Capture (CDC) tools:

领英推荐

Kushwanth Chowdary Kandala的更多文章

社区洞察

其他会员也浏览了

ETL workflow

ETL IS DEAD

What is ETL (Extract, Transform, Load)?

To hell and back with ETL. The unstoppable rise of data warehouse automation.

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

Unlocking Data Gold: Choosing the Right ETL Tool to Transform Analytics and Data Science

Reverse ETL ! : What , How , Use Cases

The Importance of Custom ETL Solutions: Extract, Transform, Load

The ETL Process: From Data Extraction to Business Insights

Change Data Capture (CDC) tools:

领英推荐

Kushwanth Chowdary Kandala的更多文章

Retail Market Analytics

AutoML + Streamlit

Credit Risk Model

Confluent Kafka Avro Schema Registry

Classifying Bird Species Using FastAI

社区洞察

其他会员也浏览了

ETL workflow

ETL IS DEAD

What is ETL (Extract, Transform, Load)?

To hell and back with ETL. The unstoppable rise of data warehouse automation.

AWS Glue, Athena and Visual ETL based Data Quality Improvement

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

Unlocking Data Gold: Choosing the Right ETL Tool to Transform Analytics and Data Science

Reverse ETL ! : What , How , Use Cases

The Importance of Custom ETL Solutions: Extract, Transform, Load

The ETL Process: From Data Extraction to Business Insights