Capture the Change, you want to see in the database.
Kaivalya Apte
The GeekNarrator Podcast | Staff Engineer | Follow me for #distributedsystems #databases #interviewing #softwareengineering
Change Data Capture
As the name suggests, it is about capturing changes to your data. By capturing, I mean reacting to changes and doing something else. If we take an example of an e-commerce application the most common functionality is `createOrder` but once an order is created in the database, we want to do more things like sending a notification to the user (an email, a message etc), like publishing this data to the warehouse for performing analytics etc.
How can we build this?
Solution 1
Well, looks simple. We can implement this logic in the application. Something like
public Order createOrder(CreateOrderRequest orderRequest) {
Order order = dao.createOrder(orderRequest);
userNotificationService.notifyOrder(order);
dataWarehouseClient.post(order);
return order;
}
As you can see, your application logic handles the post order creation work as well, and it looks straightforward. Congratulations, we have just implemented MCDC (Manual Change Data Capture). Jk, this is just an acronym I came up with. But the point is, this approach has several problems:
- If we want to capture changes in multiple systems, we need to change our application code.
public Order createOrder(CreateOrderRequest orderRequest)
Order order = dao.createOrder(orderRequest);
userNotificationService.notifyOrder(order);
dataWarehouseClient.post(order);
// new usecase
newUsecaseClient.post(order);
return order;
}
This is not great, because you need to make changes, write tests, build and deploy the entire order processing system to capture changes to yet another system (downstream)
- Another problem is, we haven't really thought about failures. What happens if createOrder is successful, but user notification fails? Or user notification is successful but data warehouse is down? Do we rollback the entire operation? Do we commit partial state and asynchronously try to complete the operations? Well yeah, we can do that, but this is yet another logic/code/infra/process to maintain, which can get super complication.
- This brings down the availability of the createOrder process (in case you rollback), because now you need multiple systems to be up and running, just to create a simple order.
- This solution isn't a general solution that can be used by other CDC use cases.
Solution 2
Another approach could be to combine the capture part into one component. The idea is to publish an event into a pub-sub topic which can be consumed by a consumer and the consumer can then do all the capturing part.
This works great, because now you don't depend only on the pub-sub system to be available and you have achieved decoupling between the downstream systems and the order service. Any new system that you want to publish changes to, you don't need any change in the order service, you can simply update the CDCWorker.
But again this has some problems, you need to implement this into your business logic. More importantly you have to implement this in all the places which needs CDC.
领英推è
Solution 3
CDC frameworks like Debezium comes to the rescue as they provide you a framework to implement CDC without even touching your business logic. Using Debezium you can stream the data change events into any third system and do whatever you want with it. But how is the change event stream captured?
To understand this, we need to understand how Databases maintain a history of changes happening to a piece of data. Typically all transactional databases maintain a Log (append only) data structure to capture all the changes happening to the data. This is done mainly for two purposes:
- Transaction recovery - If things go bad, this log helps the database to recover the state.
- Replication - Using this log, state changes can be replicated to other nodes to keep a consistent view of the data.
Now as databases already maintain this append only log of transactions, can we use this log to achieve CDC? Well yes, CDC frameworks like Debezium does that in a reliable way while letting the application developers focus only on the business logic.
Benefits:
- Application is decoupled from CDC use cases.
- CDC frameworks are highly resilient and are compatible with most of the databases.
- Provides low latency data capture (millis range), so you typically don't have to worry about lag.
- Source and sink connectors make the whole process pluggable and easy to configure.
- You can configure what data (columns) you want to expose to the stream without making any change to the application.
- You can mask sensitive data.
- You can monitor connectors using JMX.
There are several other benefits of using a standard CDC framework like Debezium. Most importantly it opens up a new world of streaming use cases without needing any changes in your application.
To know more, watch my discussion with Gunnar Morling , who is currently working with Decodable and is a former project lead for the Debezium project.
If you like this edition, please subscribe to the newsletter and The GeekNarrator youtube channel.
Also please give me a like on this post and share it with your network.
Keep learning! Keep rocking!
Cheers,
The GeekNarrator
SMTS @ OCI
2 å¹´Good one, just missing on terminologies for newbies like whats a CDC exactly.
SDE @Amazon | GSoC @RedHat | Open Source and Coding Mentor |Ex @Nagarro|Ex @Coding Blocks|System Design Content Creator|20k+ linkedin followers|3 million views|open for collaborations
2 å¹´Very insightful bhaiya Kaivalya Apte
Senior Software Engineer at Intuit
2 å¹´Nice one
Senior Software Engineer @Google | Techie007 | Google Summer of Code @2017 | Opinions and views I post are my own
2 å¹´Great content as always
Technologist at Confluent
2 å¹´Nice one!