登录查看更多内容

Mongodb database migration and consolidation using Apache Kafka + debezium connectors

Jisto Jose

Data Architect | DE | DBE | DB Consultant | DB Migration Expert | 5x Multi-cloud Certificate

发布日期: 2023年7月3日

In this article, we will use Apache Kafka and the Debezium Connector to demonstrate how to merge multiple source MongoDB databases into a single target database. I've set up three MongoDB replica clusters for this demonstration; two will serve as the source, and the third as the destination.

Existing MongoDB Due to the oplog's requirement that members of a replica set share a common history of changes, mongodb replica cannot simultaneously process changes from multiple master nodes. Therefore, we are conducting proof-of-concept research into how to combine data from two separate replica sets.

MongoDB is a popular NoSQL document-oriented database that stores data in JSON-like format. It is known for its scalability, flexibility, and ease of use in building modern applications.

MongoDB Replica Set is a set of MongoDB servers that provides high availability and fault tolerance for the database. It achieves this by maintaining multiple copies of the same data and automatically promoting a secondary node to primary in case of primary node failure.

Although the Debezium MongoDB connector does not become part of a replica set, it uses a similar replication mechanism to obtain oplog data. The main difference is that the connector does not read the oplog directly. Instead, it delegates the capture and decoding of oplog data to the MongoDB change streams feature. With change streams, the MongoDB server exposes the changes that occur in a collection as an event stream

In this setup we have hosted mangodb instances (cluster-1, cluster-2) in GCP GCE (Note : Cluster-1 and Cluster-2 we used as source). Also we have created cluster-3 as mongodb Target.

Architecture is look like below

We have used debezium connector to pull mongodb oplog data to kafka cluster which is hosted on top GKE.

About debezium connector : Debezium’s MongoDB connector tracks a MongoDB replica set or a MongoDB sharded cluster for document changes in databases and collections, recording those changes as events in Kafka topics. The connector automatically handles the addition or removal of shards in a sharded cluster, changes in membership of each replica set, elections within each replica set, and awaiting the resolution of communications problems.

I'm not going to go into detail about how to set up a kafka cluster here. We use strimizi kafka to set up pipelines.

for installation steps follow below link

Therefore, we must essentially deploy source and target connectors to the Kafka cluster. Each source should have at least one source connector, and if more flexibility is required, each collection can have its own source connector. In the ideal case, one mongodb source = one kafka source connector.

领英推荐

A gentle introduction to Embedded Databases

Arpit Bhayani 2 年前

Run Elasticsearch in Amazon Elastic Kubernetes Service…

AppsCode Inc. 2 年前

Run PostgreSQL in Amazon Elastic Kubernetes Service…

AppsCode Inc. 2 年前

We have created the following.yaml file for the source connector.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: source-mongodb-1
  annotations:
    strimzi.io/restart: "true"
  labels:
    strimzi.io/cluster: kafka-connect
spec:
  class: io.debezium.connector.mongodb.MongoDbConnector
  tasksMax: 1
  config:
    topic.creation.enable: true
    tasks.max: 1
    topic.prefix: mongodbserver1
    topic.creation.default.replication.factor: -1
    topic.creation.default.partitions: 10
    topic.creation.default.cleanup.policy: compact 
    topic.creation.default.compression.type: lz4
    mongodb.hosts: rs0/34.93.87.221:27017
    mongodb.user: deb_src_usr
    mongodb.password: abc1234
    database.include.list: sourcedb1
    schema.history.internal.kafka.bootstrap.servers : kafka:9092
    key.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true
    value.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable: true
    snapshot.mode: initial

For second mongodb cluster (No configuration difference)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: source-mongodb-2
  annotations:
    strimzi.io/restart: "true"
  labels:
    strimzi.io/cluster: kafka-connect
spec:
  class: io.debezium.connector.mongodb.MongoDbConnector
  tasksMax: 1
  config:
    topic.creation.enable: true
    tasks.max: 1
    topic.prefix: mongodbserver2
    topic.creation.default.replication.factor: -1
    topic.creation.default.partitions: 10
    topic.creation.default.cleanup.policy: compact 
    topic.creation.default.compression.type: lz4
    mongodb.hosts: rs1/34.93.197.152:27017
    mongodb.user: deb_src_usr2
    mongodb.password: abc1234
    database.include.list: sourcedb2
    schema.history.internal.kafka.bootstrap.servers : kafka:9092
    key.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true
    value.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable: true
    snapshot.mode: initial

We used below command to deploy source connector and sink connector

kubectl apply -f "/path/file.yaml"

After the source connector was set up, Kafka instantly made topics to store data. As a second step, we set up sink connectors.

Basically, each collection?needs a sink connector, so if we have n mongodb db collections,?we need n sink connectors.

Sink connector-1 configuration (which is coming from cluster-1 mongodb).

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: sink-mongodb-1
  annotations:
    strimzi.io/restart: "true"
  labels:
    strimzi.io/cluster: kafka-connect
spec:
  class: com.mongodb.kafka.connect.MongoSinkConnector
  tasksMax: 2
  config:
    tasks.max: 1
    connection.uri: mongodb://34.100.240.170:27017/?user=deb_tgt_usr&password=abc1234
    topics: mongodbserver1.sourcedb1.sales
    database: sinkdb1
    collection: sales
    key.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true    
    value.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable: true

Sink connector-2 configuration (which is coming from cluster-2 mongodb).

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: sink-mongodb-2
  annotations:
    strimzi.io/restart: "true"
  labels:
    strimzi.io/cluster: kafka-connect
spec:
  class: com.mongodb.kafka.connect.MongoSinkConnector
  tasksMax: 2
  config:
    tasks.max: 1
    connection.uri: mongodb://34.100.240.170:27017/?user=deb_tgt_usr&password=abc1234
    topics: mongodbserver2.sourcedb2.training
    database: sinkdb1
    collection: training
    key.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: true    
    value.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable: true2

above two sink connector target instance is same, only difference is the connected topic.

After deployment of this sink connector, our pipeline started flowing data from source to target.

In a nutshell, the CDC feature allowed two different MongoDB sources to send data to a single MongoDB target. In general mongodb replica cant take two master node data. This is the rationale that underlies the poc.

By utilising the mechanisms described above, we were able to successfully achieve mongodb consolidation of several into one.

Database Migration

585 位关注者

要查看或添加评论，请登录

Jisto Jose的更多文章

Simple SQL server to Postgres migration (Without CDC)

2024年2月25日

Simple SQL server to Postgres migration (Without CDC)

Through this exercise, we will migrate a SQL Server database using only the inbuilt options of SQL Server and no…

1 条评论
RDS Oracle to RDS postgres using AWS DMS

2023年11月18日

RDS Oracle to RDS postgres using AWS DMS

Using this migration we can migrate data from postgres database to oracle database seamlessly , In this use case we…
AWS Aurora Postgres to GCP Cloud SQL Postgres using External Replica With SSL

2023年8月12日

AWS Aurora Postgres to GCP Cloud SQL Postgres using External Replica With SSL

In this document, we will conduct a Proof of Concept on AWS Aurora Postgres to GCP Cloud SQL Postgres with External…

4 条评论
AWS RDS to GCP AlloyDB (Postgres) Migration using GCP DMS

2023年7月30日

AWS RDS to GCP AlloyDB (Postgres) Migration using GCP DMS

Contents Credits : Raj Kumar Source and Destination: Source - AWS RDS PostgreSQL Destination - GCP AlloyDB PostgreSQL…

4 条评论
On-Prem Postgres to GCP Cloud SQL Migration Using GCP DMS

2023年5月20日

On-Prem Postgres to GCP Cloud SQL Migration Using GCP DMS

In this article, we would discuss the Migration Strategy and create Migration Template for On-prem PostgreSQL to…

2 条评论
On-prem MySQL to GCP Cloud SQL MySQL - Database Migration using External Replica

2023年5月1日

On-prem MySQL to GCP Cloud SQL MySQL - Database Migration using External Replica

This article describes how to migrate on-premises/self-managed MySQL to GCP Cloud MySQL with minimal downtime and…
Debezium Pipeline

2023年4月7日

Debezium Pipeline

Debezium is an open-source distributed platform for change data capture (CDC). It enables you to capture changes made…
Bulk Data Load using Apache Sqoop

2023年3月25日

Bulk Data Load using Apache Sqoop

Apache Sqoop is an open-source tool that is widely used for transferring bulk data between Hadoop and structured data…
5R in Database migration

2023年3月19日

5R in Database migration

Database migration can be a complex process, involving multiple steps and decisions that can impact the performance and…
Change Data Capture

2023年3月12日

Change Data Capture

Change Data Capture (CDC) is a technology used to capture changes made to a database in real-time. CDC captures insert,…

See all articles

Mongodb database migration and consolidation using Apache Kafka + debezium connectors

Jisto Jose

Data Architect | DE | DBE | DB Consultant | DB Migration Expert | 5x Multi-cloud Certificate

领英推荐

Database Migration

585 位关注者

Jisto Jose的更多文章

社区洞察

其他会员也浏览了

Run PostgreSQL in Azure Kubernetes Service (AKS) Using KubeDB

Week 19 (6 May - 12 May)

Understanding Serverless PostgreSQL

Mydbops Newsletter January 2025 Edition

When And How To Use MongoDB For Distributed Database Architecture?

MongoDB indexing tutorial with examples

Deploy Production-Grade MySQL Cluster in Rancher Using KubeDB

How Companies Using MongoDB?

Database Secrets Engine with MongoDB

Introduction to MongoDB

领英推荐

Database Migration

585 位关注者

Jisto Jose的更多文章

Simple SQL server to Postgres migration (Without CDC)

RDS Oracle to RDS postgres using AWS DMS

AWS Aurora Postgres to GCP Cloud SQL Postgres using External Replica With SSL

AWS RDS to GCP AlloyDB (Postgres) Migration using GCP DMS

On-Prem Postgres to GCP Cloud SQL Migration Using GCP DMS

On-prem MySQL to GCP Cloud SQL MySQL - Database Migration using External Replica

Debezium Pipeline

Bulk Data Load using Apache Sqoop

5R in Database migration

Change Data Capture

社区洞察

其他会员也浏览了

Run PostgreSQL in Azure Kubernetes Service (AKS) Using KubeDB

Week 19 (6 May - 12 May)

Understanding Serverless PostgreSQL

Mydbops Newsletter January 2025 Edition

When And How To Use MongoDB For Distributed Database Architecture?

MongoDB indexing tutorial with examples

Deploy Production-Grade MySQL Cluster in Rancher Using KubeDB

How Companies Using MongoDB?

Database Secrets Engine with MongoDB

Introduction to MongoDB