登录查看更多内容

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

发布日期: 2024年11月22日

Kafka Connect: Build and Run Data Pipelines, by Mickael Maison and Kate Stanley, O'Reilly September 2023, 400 pages.

I recently came across this book about Apache Kafka Connect and thought it worth my while speed reading it and giving a quick review. It was luckily available on my O'Reilly subscription, so was easy to access. I'm a Technology Evangelist at NetApp Instaclustr, these days in the DevRel team.

So why should Developers be interested in Kafka Connect? After all, I wrote a blog series a few years ago focussed on building “zero-code” Kafka Connect data processing pipelines!

But Kafka Connect is an important part of the Kafka ecosystem, and you are likely to end up needing it sooner or later as part of a bigger application development. And you may even need to write your own custom Kafka connectors which is definitely a non-trivial development task. Kafka Connect is designed for building robust scalable streaming pipelines and integrations with multiple heterogenous other systems.

The book starts with a good high-level overview of Kafka Connect including the Kafka Connect cluster, source and sink connectors, and some of the other nuts and bolts (plugins) such as converters, transformations and predicates making up pipelines.? Also covered are worker plugin-ins which enable reliable/scalable running of connectors.

They mentioned a few alternative integration frameworks to Kafka Connect, but I was puzzled that this is the only mention of Apache Camel as in my experience the Camel Kafka Connectors are the largest collection of open source Kafka connectors available – a whole chapter could have been easily dedicated to the Camel Kafka Connectors - maybe the next upgrade? Here is the latest of my Kafka Connect Camel Connector blogs!

Chapter 2 is a good in-depth introduction to Apache Kafka – possibly more than is needed for a Kafka Connect book however – I'm not sure that Kafka Streams is really all that relevant here. But I have had a "thought experiment" around running Kafka Streams in Kafka Connect - not sure it could work or not!

Part II starts focussing on Kafka Connect, particularly for one of the roles they are aiming at, data engineers. ?It covers the Kafka connect runtime, libraries, starting Kafka connect, configuring, running connectors, using the REST API, deployment, and how to load plugins (for Instaclustr’s managed Kafka Connect service everything has to be in an uber jar file). There’s more detail on source and sink connectors and task configuration and running them. There’s a good section on Converters including Data Format and Schemas (you may need a Schema Registry for this, we recommend the open source Karapace, see here for my Karapace blog series), configuring, using, and transformations and predicates. Unfortunately, the examples in this section used the relatively trivial file stream connectors which will also not work on a cloud hosted managed Kafka cluster service.

There’s a chapter on “designing” or choosing a connector (which is definitely a non-trivial task from my experience – connectors come in all manner of types and qualities and open source versions etc, and configuration is often a challenge). Also watch out as Apache Camel Kafka Connectors sometimes combine both source and sink connectors, which you get is determined by configuration. There’s a good discussion of mapping between systems (which is really what Kafka Connect is doing) formatting data, and processing semantics which is a useful summary of how the Kafka delivery semantics apply to Kafka Connect. Handling failures is also critical – Kafka Connect only handles some types of failures automatically, you also need to ensure exceptions are caught and handled correctly. ?This is a very important reminder (and something I had to learn the hard way):?

“Each task can also encounter an error (and be marked?FAILED) separately from the connector. By default, if a task has a problem, Kafka Connect lets it crash, marks it as?FAILED, and does not attempt to restart it automatically.”

Chapter 5 looks at some connectors in action – a S3 Sink Connector, a JDBC Source Connector, and the Debezium MySQL Source Connector. Unfortunately, only the last of these has an open source license. ?I found the object partitioner to be an interesting requirement/feature of the S3 example. ??If you need a simple open source JDBC Sink Connector here’s a blog I wrote about one I made https://www.instaclustr.com/blog/kafka-postgres-connector-pipeline-series-part-6/ and here's the connector code and jar https://github.com/instaclustr/kafka-connect-jdbc-sink

The next chapter is dedicated to MirrorMaker2 (MM2) which is built on Kafka Connect, so it’s a very good use case for the Connect framework. I discovered something about MM2 that I didn’t know – as well as running it in standalone or distributed modes (in common with other Kafka Connectors) it has a dedicated driver mode. This apparently automates the deployment of multiple connectors. They also cover MM2 security, metrics and checkpointing – earlier than they mention similar topics for general kafka connectors (see later chapters).

Part III covers running in Kafka Connect in production. There are lots of useful operational details in this part, but you could just use Instaclustr’s managed Kafka Connect service to make things simpler. ?There is, however, a good discussion about Kafka Connect resource utilization and scaling, which is relevant even if you are using a managed service, and debugging connectors – which from past experience is highly likely!

Chapter 8 is a very useful summary of Kafka Connect configurations – of which there are always many and varied, but many depend on the exact connectors used.

They have left monitoring until Chapter 9 – possibly you will need to consult this earlier, however, particularly errors!? You will see lots of those for sure.

Another chapter covers more “self-help” material such as running Kafka Connect on Kubernetes.

领英推荐

Proposal for a Management Architecture for Large…

INNOVANT 1 年前

Advanced Techniques for Optimizing Apache Iceberg…

Upsolver (acquired by Qlik) 11 个月前

What Skills Should Every Data Engineer Have in 2025? ??

WalkingTree Resources Pvt. Ltd. 1 个月前

Part IV looks to be very valuable for developers – Building Custom Connectors and Plugins. Chapter 11 covers building source and sink connectors and appears to cover all the critical Kafka connect components required to build your own. But there’s more! Chapter 12 explains how to extend Kafka connect with connector and worker plug-ins – although it’s not entirely clear to me when/why this is useful.

And that’s it!? I think the idea that this is really three books (as suggested in the Preface) is more or less correct – it’s a book for data engineers (people who want to select, configure and run connectors), ?a book for reliability engineers (techops I guess) who are in charge of deploying and running Kafka Connect clusters, and developers (people who need to use connectors are part of a wider application and/or write/customised their own connectors) – about 1/3 of the chapters are tailored more to one of these roles so you may need to jump around a bit to find the most interesting bits for your interests.

This book is a recent book that is up-to-date and definitely worth a read, and probably for reference as well.

The Apache documentation also has good references for Kafka Connect and configurations.

The open source Apache Camel Kafka Connectors are worth checking out.

Some of my random blogs on Kafka Connect are:

Apache Kafka Connect Architecture Overview

The Kafka Connect pipeline series (REST source connector, Elasticsearch and PostgreSQL sink connectors)

Kafka Cassandra Connectors part 1 and part 2

MM2 theory and practice

Apache Camel Kafka Connectors here, here and here

Debezium PostgreSQL connector

Debezium Cassandra connector

Mickael Maison

Working on Kafka at Red Hat

3 个月

Thanks for the review and feedback!

2 次回应

查看更多评论

要查看或添加评论，请登录

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

2025年3月3日

Load Testing - of a bridge, by lots of trains!

Finally, an opportunity to combine software performance engineering with trains in a way that's not too far-fetched! I…
Three decades of laptop computers

2025年2月23日

Three decades of laptop computers

I was tidying up the garage on the weekend and came across a stack of old laptops that I've been "accidentally"…
Open Source Performance Engineering: Blogs – Part 1

2025年2月19日

Open Source Performance Engineering: Blogs – Part 1

I recently needed to track down and summarise some of my Performance Engineering blogs (covering performance…
20 years of Open Source from Grid to Cloud Computing

2024年12月17日

20 years of Open Source from Grid to Cloud Computing

Given that it's coming to the end of 2024 I was thinking back to what I was up to 20 years ago, in 2004. That feels…
Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

2024年10月23日

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

After much anticipation, the 6th Community over Code Performance Engineering track was held on October 7 2024 in…

2 条评论
Seven Years of Open Source DevRel Technology Fun With Instaclustr

2024年8月6日

Seven Years of Open Source DevRel Technology Fun With Instaclustr

Seven years ago tomorrow I joined Instaclustr as the first Technology Evangelist to help explain multiple open source…

4 条评论
The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

2024年6月17日

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

The 4th Community over Code Performance Engineering track was on recently in Bratislava. Thanks to everyone who made it…
Kafka Summit Bangalore 2024 - Interesting Talks

2024年5月9日

Kafka Summit Bangalore 2024 - Interesting Talks

Last week I attended the Apache Kafka Summit Bangalore (India, along with thousands of other speakers and attendees -…
What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

2024年4月22日

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

Last week I presented at FOSSASIA which was held in Hanoi, Vietnam. During my time in Hanoi, I had two experiences that…

3 条评论
Connecting to Instaclustr Managed PostgreSQL? and Apache Kafka? from Payara Cloud

2024年3月14日

Connecting to Instaclustr Managed PostgreSQL? and Apache Kafka? from Payara Cloud

Paul Brebner, Instaclustr Technology Evangelist https://www.instaclustr.

See all articles

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

领英推荐

Paul Brebner的更多文章

社区洞察

其他会员也浏览了

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Data Engineering: The Backbone of Modern Data Science

Data Engineering: The Backbone of Modern Data Science

Modern Data Quality with Apache Impala: Upscaling Your Data Management Strategy

Revolutionizing Data Management in AWS: The Case for Apache Iceberg Over Traditional Table Formats

Change Data Capture (CDC) when there is no CDC

Apache Hudi: Copy on Write(CoW) Table

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

Decoding Data Engineering: A Comprehensive Guide for Tech Recruiters

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

领英推荐

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

Three decades of laptop computers

Open Source Performance Engineering: Blogs – Part 1

20 years of Open Source from Grid to Cloud Computing

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

Seven Years of Open Source DevRel Technology Fun With Instaclustr

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

Kafka Summit Bangalore 2024 - Interesting Talks

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

Connecting to Instaclustr Managed PostgreSQL? and Apache Kafka? from Payara Cloud

社区洞察

其他会员也浏览了

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Data Engineering: The Backbone of Modern Data Science

Data Engineering: The Backbone of Modern Data Science

Modern Data Quality with Apache Impala: Upscaling Your Data Management Strategy

Revolutionizing Data Management in AWS: The Case for Apache Iceberg Over Traditional Table Formats

Change Data Capture (CDC) when there is no CDC

Apache Hudi: Copy on Write(CoW) Table

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

Decoding Data Engineering: A Comprehensive Guide for Tech Recruiters

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs