Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Kafka Connect: Build and Run Data Pipelines, by Mickael Maison and Kate Stanley, O'Reilly September 2023, 400 pages.

I recently came across this book about Apache Kafka Connect and thought it worth my while speed reading it and giving a quick review. It was luckily available on my O'Reilly subscription, so was easy to access. I'm a Technology Evangelist at NetApp Instaclustr, these days in the DevRel team.

So why should Developers be interested in Kafka Connect? After all, I wrote a blog series a few years ago focussed on building “zero-code” Kafka Connect data processing pipelines!

But Kafka Connect is an important part of the Kafka ecosystem, and you are likely to end up needing it sooner or later as part of a bigger application development. And you may even need to write your own custom Kafka connectors which is definitely a non-trivial development task. Kafka Connect is designed for building robust scalable streaming pipelines and integrations with multiple heterogenous other systems.

The book starts with a good high-level overview of Kafka Connect including the Kafka Connect cluster, source and sink connectors, and some of the other nuts and bolts (plugins) such as converters, transformations and predicates making up pipelines.? Also covered are worker plugin-ins which enable reliable/scalable running of connectors.

They mentioned a few alternative integration frameworks to Kafka Connect, but I was puzzled that this is the only mention of Apache Camel as in my experience the Camel Kafka Connectors are the largest collection of open source Kafka connectors available – a whole chapter could have been easily dedicated to the Camel Kafka Connectors - maybe the next upgrade? Here is the latest of my Kafka Connect Camel Connector blogs!

Chapter 2 is a good in-depth introduction to Apache Kafka – possibly more than is needed for a Kafka Connect book however – I'm not sure that Kafka Streams is really all that relevant here. But I have had a "thought experiment" around running Kafka Streams in Kafka Connect - not sure it could work or not!

Part II starts focussing on Kafka Connect, particularly for one of the roles they are aiming at, data engineers. ?It covers the Kafka connect runtime, libraries, starting Kafka connect, configuring, running connectors, using the REST API, deployment, and how to load plugins (for Instaclustr’s managed Kafka Connect service everything has to be in an uber jar file). There’s more detail on source and sink connectors and task configuration and running them. There’s a good section on Converters including Data Format and Schemas (you may need a Schema Registry for this, we recommend the open source Karapace, see here for my Karapace blog series), configuring, using, and transformations and predicates. Unfortunately, the examples in this section used the relatively trivial file stream connectors which will also not work on a cloud hosted managed Kafka cluster service.

There’s a chapter on “designing” or choosing a connector (which is definitely a non-trivial task from my experience – connectors come in all manner of types and qualities and open source versions etc, and configuration is often a challenge). Also watch out as Apache Camel Kafka Connectors sometimes combine both source and sink connectors, which you get is determined by configuration. There’s a good discussion of mapping between systems (which is really what Kafka Connect is doing) formatting data, and processing semantics which is a useful summary of how the Kafka delivery semantics apply to Kafka Connect. Handling failures is also critical – Kafka Connect only handles some types of failures automatically, you also need to ensure exceptions are caught and handled correctly. ?This is a very important reminder (and something I had to learn the hard way):?

“Each task can also encounter an error (and be marked?FAILED) separately from the connector. By default, if a task has a problem, Kafka Connect lets it crash, marks it as?FAILED, and does not attempt to restart it automatically.”

Chapter 5 looks at some connectors in action – a S3 Sink Connector, a JDBC Source Connector, and the Debezium MySQL Source Connector. Unfortunately, only the last of these has an open source license. ?I found the object partitioner to be an interesting requirement/feature of the S3 example. ??If you need a simple open source JDBC Sink Connector here’s a blog I wrote about one I made https://www.instaclustr.com/blog/kafka-postgres-connector-pipeline-series-part-6/ and here's the connector code and jar https://github.com/instaclustr/kafka-connect-jdbc-sink

The next chapter is dedicated to MirrorMaker2 (MM2) which is built on Kafka Connect, so it’s a very good use case for the Connect framework. I discovered something about MM2 that I didn’t know – as well as running it in standalone or distributed modes (in common with other Kafka Connectors) it has a dedicated driver mode. This apparently automates the deployment of multiple connectors. They also cover MM2 security, metrics and checkpointing – earlier than they mention similar topics for general kafka connectors (see later chapters).

Part III covers running in Kafka Connect in production. There are lots of useful operational details in this part, but you could just use Instaclustr’s managed Kafka Connect service to make things simpler. ?There is, however, a good discussion about Kafka Connect resource utilization and scaling, which is relevant even if you are using a managed service, and debugging connectors – which from past experience is highly likely!

Chapter 8 is a very useful summary of Kafka Connect configurations – of which there are always many and varied, but many depend on the exact connectors used.

They have left monitoring until Chapter 9 – possibly you will need to consult this earlier, however, particularly errors!? You will see lots of those for sure.

Another chapter covers more “self-help” material such as running Kafka Connect on Kubernetes.

Part IV looks to be very valuable for developers – Building Custom Connectors and Plugins. Chapter 11 covers building source and sink connectors and appears to cover all the critical Kafka connect components required to build your own. But there’s more! Chapter 12 explains how to extend Kafka connect with connector and worker plug-ins – although it’s not entirely clear to me when/why this is useful.

And that’s it!? I think the idea that this is really three books (as suggested in the Preface) is more or less correct – it’s a book for data engineers (people who want to select, configure and run connectors), ?a book for reliability engineers (techops I guess) who are in charge of deploying and running Kafka Connect clusters, and developers (people who need to use connectors are part of a wider application and/or write/customised their own connectors) – about 1/3 of the chapters are tailored more to one of these roles so you may need to jump around a bit to find the most interesting bits for your interests.

This book is a recent book that is up-to-date and definitely worth a read, and probably for reference as well.


The Apache documentation also has good references for Kafka Connect and configurations.

The open source Apache Camel Kafka Connectors are worth checking out.


Some of my random blogs on Kafka Connect are:

Apache Kafka Connect Architecture Overview

The Kafka Connect pipeline series (REST source connector, Elasticsearch and PostgreSQL sink connectors)

Kafka Cassandra Connectors part 1 and part 2

MM2 theory and practice

Apache Camel Kafka Connectors here, here and here

Debezium PostgreSQL connector

Debezium Cassandra connector

Mickael Maison

Working on Kafka at Red Hat

3 个月

Thanks for the review and feedback!

要查看或添加评论,请登录

Paul Brebner的更多文章

社区洞察

其他会员也浏览了