Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 CDC with Amazon MSK
In the?previous post, we discussed a data lake solution where data ingestion is performed using?change data capture (CDC)?and the output files are?upserted?to an?Apache Hudi?table. Being registered to?Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The?Northwind database?is used as the source database and, following the?transactional outbox pattern, order-related changes are?upserted?to an outbox table by triggers. The data ingestion is developed using Kafka connectors in the?local Confluent platform?where the?Debezium for PostgreSQL?is used as the source connector and the?Lenses S3 sink connector?is used as the sink connector. We confirmed the order creation and update events are captured as expected and it is ready for production deployment. In this post, we’ll build the CDC part of the solution on AWS using?Amazon MSK?and?MSK Connect.
Architecture
As described in a?Red Hat IT topics article,?change data capture (CDC) is a proven data integration pattern to track when and what changes occur in data then alert other systems and services that must respond to those changes. Change data capture helps maintain consistency and functionality across all systems that rely on data.
?The primary use of CDC is to enable applications to respond almost immediately whenever data in databases change. Specifically its use cases cover microservices integration, data replication with up-to-date data, building time-sensitive analytics dashboards, auditing and compliance, cache invalidation, full-text search and so on. There are a number of approaches for CDC – polling, dual writes and log-based CDC. Among those,?log-based CDC has advantages?to other approaches.
?Both?Amazon DMS?and?Debezium?implement log-based CDC. While the former is a managed service, the latter can be deployed to a Kafka cluster as a (source) connector. It uses?Apache Kafka?as a messaging service to deliver database change notifications to the applicable systems and applications. Note that Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems by connectors. In AWS, we can use?Amazon MSK?and?MSK Connect?for building a Debezium based CDC solution.
领英推荐
?Data replication to data lakes using CDC can be much more effective if?data is stored to a format?that supports atomic transactions and consistent updates. Popular choices are?Apache Hudi,?Apache Iceberg?and?Delta Lake. Among those, Apache Hudi can be a good option as it is?well-integrated with AWS services.
?Below shows the architecture of the data lake solution that we will be building in this series of posts.
In this post, we’ll build the CDC part of the solution on AWS using Amazon MSK and MSK Connect.