登录查看更多内容

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK

Jaehyeon Kim

?? Data Engineer ?? Blogger

发布日期: 2021年12月13日

In the?previous post, we discussed a data lake solution where data ingestion is performed using?change data capture (CDC)?and the output files are?upserted?to an?Apache Hudi?table. Being registered to?Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The?Northwind database?is used as the source database and, following the?transactional outbox pattern, order-related changes are?upserted?to an outbox table by triggers. The data ingestion is developed using Kafka connectors in the?local Confluent platform?where the?Debezium for PostgreSQL?is used as the source connector and the?Lenses S3 sink connector?is used as the sink connector. We confirmed the order creation and update events are captured as expected and it is ready for production deployment. In this post, we’ll build the CDC part of the solution on AWS using?Amazon MSK?and?MSK Connect.

Architecture

As described in a?Red Hat IT topics article,?change data capture (CDC) is a proven data integration pattern to track when and what changes occur in data then alert other systems and services that must respond to those changes. Change data capture helps maintain consistency and functionality across all systems that rely on data.

?The primary use of CDC is to enable applications to respond almost immediately whenever data in databases change. Specifically its use cases cover microservices integration, data replication with up-to-date data, building time-sensitive analytics dashboards, auditing and compliance, cache invalidation, full-text search and so on. There are a number of approaches for CDC – polling, dual writes and log-based CDC. Among those,?log-based CDC has advantages?to other approaches.

?Both?Amazon DMS?and?Debezium?implement log-based CDC. While the former is a managed service, the latter can be deployed to a Kafka cluster as a (source) connector. It uses?Apache Kafka?as a messaging service to deliver database change notifications to the applicable systems and applications. Note that Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems by connectors. In AWS, we can use?Amazon MSK?and?MSK Connect?for building a Debezium based CDC solution.

领英推荐

The Data Lakehouse: The Benefits, Implementation…

Alex Merced 1 个月前

Which Data Pipeline Orchestration Tool Is Right…

Satish Chandra Gupta 2 年前

Managing Big Data with Azure Data Lake: Architecture…

ADFAR Tech 1 年前

?Data replication to data lakes using CDC can be much more effective if?data is stored to a format?that supports atomic transactions and consistent updates. Popular choices are?Apache Hudi,?Apache Iceberg?and?Delta Lake. Among those, Apache Hudi can be a good option as it is?well-integrated with AWS services.

?Below shows the architecture of the data lake solution that we will be building in this series of posts.

In this post, we’ll build the CDC part of the solution on AWS using Amazon MSK and MSK Connect.

Continue...

要查看或添加评论，请登录

Jaehyeon Kim的更多文章

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

2022年4月4日

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development…
Use External Schema Registry with MSK Connect – Part 1 Local Development

2022年3月7日

Use External Schema Registry with MSK Connect – Part 1 Local Development

When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that…
Simplify Your Development on AWS with Terraform

2022年2月7日

Simplify Your Development on AWS with Terraform

When I wrote my data lake demo series (part 1, part 2 and part 3) recently, I used an Aurora PostgreSQL, MSK and EMR…
EMR on EKS by Example

2022年1月17日

EMR on EKS by Example

EMR on EKS provides a deployment option for Amazon EMR that allows you to automate the provisioning and management of…
Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

2021年12月20日

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build…
Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

2021年12月6日

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data…

1 条评论
Thoughts on Apache Airflow AWS Lambda Operator

2020年4月13日

Thoughts on Apache Airflow AWS Lambda Operator

Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers…

2 条评论
Dynamic Routing and Centralized Auth with Traefik, Python and R Example

2019年11月29日

Dynamic Routing and Centralized Auth with Traefik, Python and R Example

Ingress in Kubernetes exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. By setting…
Distributed Task Queue with Python and R Example

2019年11月15日

Distributed Task Queue with Python and R Example

While I'm looking into Apache Airflow, a workflow management tool, I thought it would be beneficial to get some…
Linux Dev Environment on Windows

2019年11月1日

Linux Dev Environment on Windows

I use Linux containers a lot for development. Having Windows computers at home and work, I used to use Linux VMs on…

See all articles

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK

Jaehyeon Kim

?? Data Engineer ?? Blogger

Architecture

领英推荐

Continue...

Jaehyeon Kim的更多文章

社区洞察

其他会员也浏览了

Databricks vs. AWS Lakehouse

10 big data technologies you must know

Data Lake And Data Warehouse

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Difference Between Data Lakehouse and Delta Lake

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Data Technology Trend #8: Data Next

Navigating Big Data with Kafka: A Beginner's Guide

Building the Future of Data Architecture with Apache Pinot

Low-Latency Data Pipelines with Kafka and Apache Pinot

Architecture

领英推荐

Continue...

Jaehyeon Kim的更多文章

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

Use External Schema Registry with MSK Connect – Part 1 Local Development

Simplify Your Development on AWS with Terraform

EMR on EKS by Example

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Database and Local Development

Thoughts on Apache Airflow AWS Lambda Operator

Dynamic Routing and Centralized Auth with Traefik, Python and R Example

Distributed Task Queue with Python and R Example

Linux Dev Environment on Windows

社区洞察

其他会员也浏览了

Databricks vs. AWS Lakehouse

10 big data technologies you must know

Data Lake And Data Warehouse

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Difference Between Data Lakehouse and Delta Lake

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Data Technology Trend #8: Data Next

Navigating Big Data with Kafka: A Beginner's Guide

Building the Future of Data Architecture with Apache Pinot

Low-Latency Data Pipelines with Kafka and Apache Pinot