登录查看更多内容

“Why Persist When You Can Stream?”

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

发布日期: 2022年12月16日

DM Radio show December 15, 2022 at 3 PM ET

It was nice to be back on DM Radio a year from my first appearance in 2021 (“You Can Step in the Same Stream Twice”, December 16, 2021). This time the topic was “Why Persist When You Can Stream?” with the host Eric Kavanagh asking the questions and myself and Gary Hagmueller from arcion.io as guests.? As usual, Eric was a dynamic host and the discussion was nicely free ranging, but I may not have been very coherent as I had to get up with the sun in Canberra as the show was live in the USA, so here are the notes that I had prepared earlier if you want to find out more!

1. Given that Apache Kafka is a streaming platform, what's the story around Kafka and Persistence??

Apache? Kafka is a distributed publish-subscribe system for streaming data from producers to consumers via Kafka brokers and topics.? It’s great for low-latency high-volume applications - i.e. streaming data. But, does it use persistence internally, or make it easier to work with persistent data? Yes to both!

Kafka persists messages on the broker disks for data durability (zero message loss), replication (high availability and failure tolerance), fan-out for delivery of the same message to multiple consumer groups, and replaying events - this is a Kafka super power with many use cases.? And messages can be retained for as long as you like

Kafka microservices (built from consumers) often use a persistent backing store (e.g. PostgreSQL, Cassandra) if they need access to stateful data (E.g. Anomalia Machina Anomaly Detection blog series, and geospatial anomaly detection mini-series, links on this page and this )

The Kafka Streams API is used to build powerful streaming applications to combine data from multiple topics and create new data streams.? State stores are a type of light-weight database which compute state from events and allow queries from Streams applications. State is created from event streams using event-sourcing - which is relevant to Uber’s Cadence also (E.g.? Cluedo Kafka streams example, IoT Logistics demo application truck overloading example).

The Kafka Connect API enables easy integration with diverse source and sink systems via Kafka. These are often databases or data analytics systems (E.g. Kafka connect pipeline series, which consumed NOOA Tidal Data, and sent it to either OpenSearch or PostgreSQL for visualization by Kibana/Apache Superset).

But if you want to use a database as a source system, you need to do something clever to turn stateful data into change data events. Debezium is an open source technology for this - it typically runs in the source database and use Kafka connect to send the change data stream to Kafka and then to other downstream systems. (E.g. Debezium for PostgreSQL and Cassandra blogs).

Up until very recently Kafka actually relied on an external system for meta-data storage and to coordinate the Kafka brokers - another Apache project, Zookeeper - explicitly designed for distributed systems coordination.? More about that next.

2. What's new for Apache Kafka this year?

At the ApacheCon New Orleans conference in October this year I actually gave two talks on ZooKeeper which were more or less contradictory. In the first talk, I advocated ZooKeeper as a powerful and mature open source technology for distributed systems management. In the second talk I explained why Apache Kafka was abandoning the zookeeper so it is no longer needed (for Kafka at least)!

So the big change for Kafka this year is akin to a brain transplant! Instead of the external Zookeeper service a new internal protocol called KRaft (Kafka Raft protocol) has replaced it for meta-data management/storage and broker elections.?

KRaft uses Kafka itself for meta-data storage. It will be faster, more scalable, and cheaper/easier to manage as you won’t have to worry about running an external Zookeeper service anymore.?

Instaclustr will have managed Kafka with KRaft available soon, but in the meantime, I did some experiments myself.

I found that the new KRaft mode is a lot faster to create Kafka topics with many partitions (1s c.f. > 10s of seconds) - partitions are the main concurrency mechanism in Kafka. The more partitions a topic has, the more concurrency, consumers and therefore message throughput it can handle.

领英推荐

What is Kafka? The Secret to Lightning-Fast Data…

Alex Wang 8 个月前

What makes Kafka so fast and efficient?

Vivek Bansal 10 个月前

Server-Sent Events Using Spring WebFlux and Reactive…

Egen 1 年前

If a server fails you need to move partitions to another broker - it’s a lot faster now? (42s c.f. 600s) to move 10,000 partitions.?

And finally it’s possible to now create topics with a very large number of partitions, potentially more than 1 Million! So you can create some very big and high throughput Kafka clusters if you need to. Many of our Kafka customers already have very big clusters and the challenge of growing workloads so this “brain transplant” is likely to be important to them.??

[In the context of the show, I talked about this change in the context of more general economic and technical challenges requiring optimisation and efficiency gains].

3. Have you come across any new Use Cases for Kafka this year, or any new interesting open source technologies, or both?

I also talked about the interesting Drone delivery demo application I wrote, blogged and talked about this year. It was built using a combination of a new open source workflow engine, Uber’s Cadence, and Apache Kafka.? I thought it was an interesting example for this show, combining streaming data and persistent data (and state). In one of my blogs I called it the “ballet” integration pattern - a way of combining orchestration (workflow) and choreography (streaming) systems into one unified real-time event-based application which is also stateful, fault-tolerant, and potentially long-running. Also enabling systems running in different time-frames to interact - streaming data is fast and high volume, but workflows are sequence/order based, and long-running with scheduled steps.??

Cadence is a workflow engine, good for orchestrating multiple sequenced, stateful, scheduled, and long-running steps.? It’s highly scalable and fault tolerant. It actually uses event-sourcing to achieve both of these (similar to Kafka streams state stores), with state written/read to/from Postgresql or Cassandra database.? Also Kafka+OpenSearch for enhanced workflow visibility and management. So again, like Kafka, it uses both persistent data and streaming data internally.?

Cadence is a good match for the Drone delivery problem, as there are lots of steps in the delivery process to get the drone from its base to the customer and back again, including computing drone movement from one location to the next, and managing the orders, and you want the software to be fault-tolerant so drones don’t get lost or just drop out of the sky - they have enough problems with bird attacks already.

I modeled both drones and order deliveries as Cadence workflows - each drone has a Drone workflow (which just repeats), but there’s a new Order workflow for each order.

Cadence is an example of a persistent orchestration system, so where does streaming fit? Well, I wanted to explore ways to integrate it with Kafka. There were a couple of things I tried out.

First, starting the delivery process is event-driven. Once a customer places an order an event is sent to Kafka which then sends another event to Cadence to start a new Order workflow. This notifies the donut shop about the new order.? This is pretty simple, but how do we coordinate the drones and the orders, as the different workflows don’t know about each other, and you want exactly one drone to collect each order, as fast as possible.?

I introduced a new Kafka microservice which matches ready orders and drones. Once an order is ready, the Order workflow sends a message to Kafka, and when a drone is charged and ready to go, it also sends a message to Kafka. The microservice matches a ready order with a ready drone and sends a message to that drone using a Cadence “signal” - a Cadence event mechanism - and the drone workflow proceeds to manage the flight to the donut shop, picking up the donuts, delivering them to the customer, and return to base. Hopefully avoiding hungry birds. All the while it sends messages to the Order workflow as well updating location data, so the customer can track the delivery and get ready to receive their donuts, completing the order. The drone flies back to base and recharges and is then ready for the next order. Note that it’s a bit tricky to explain this without a diagram, here’s the one from the final blog:

No alt text provided for this image — Cadence + Kafka Drone application swim-lane diagrams

So this was a successful demonstration of integrating two different but complementary styles of big data architectures - Cadence workflow orchestration and Kafka streaming choreography, achieving a nice Drone "Ballet".

How well did it work? With the prototype, we managed to get 2,000 (simulated) drones flying at once. But it was realistic enough that one of my work colleagues checked to see if I had obtained a drone flying license.

要查看或添加评论，请登录

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

2025年3月3日

Load Testing - of a bridge, by lots of trains!

Finally, an opportunity to combine software performance engineering with trains in a way that's not too far-fetched! I…
Three decades of laptop computers

2025年2月23日

Three decades of laptop computers

I was tidying up the garage on the weekend and came across a stack of old laptops that I've been "accidentally"…

1 条评论
Open Source Performance Engineering: Blogs – Part 1

2025年2月19日

Open Source Performance Engineering: Blogs – Part 1

I recently needed to track down and summarise some of my Performance Engineering blogs (covering performance…
20 years of Open Source from Grid to Cloud Computing

2024年12月17日

20 years of Open Source from Grid to Cloud Computing

Given that it's coming to the end of 2024 I was thinking back to what I was up to 20 years ago, in 2004. That feels…
Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

2024年11月22日

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Kafka Connect: Build and Run Data Pipelines, by Mickael Maison and Kate Stanley, O'Reilly September 2023, 400 pages. I…

2 条评论
Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

2024年10月23日

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

After much anticipation, the 6th Community over Code Performance Engineering track was held on October 7 2024 in…

2 条评论
Seven Years of Open Source DevRel Technology Fun With Instaclustr

2024年8月6日

Seven Years of Open Source DevRel Technology Fun With Instaclustr

Seven years ago tomorrow I joined Instaclustr as the first Technology Evangelist to help explain multiple open source…

4 条评论
The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

2024年6月17日

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

The 4th Community over Code Performance Engineering track was on recently in Bratislava. Thanks to everyone who made it…
Kafka Summit Bangalore 2024 - Interesting Talks

2024年5月9日

Kafka Summit Bangalore 2024 - Interesting Talks

Last week I attended the Apache Kafka Summit Bangalore (India, along with thousands of other speakers and attendees -…
What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

2024年4月22日

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

Last week I presented at FOSSASIA which was held in Hanoi, Vietnam. During my time in Hanoi, I had two experiences that…

3 条评论

See all articles

“Why Persist When You Can Stream?”

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

1. Given that Apache Kafka is a streaming platform, what's the story around Kafka and Persistence??

2. What's new for Apache Kafka this year?

领英推荐

Paul Brebner的更多文章

社区洞察

其他会员也浏览了

Apache Flink: Real-Time Data Processing at Scale

Kafka with KRaft (Kafka Raft)

Enterprise DataHub

The Kafka Report 006: Latest Kafka Trends, Playbooks, and Resources

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

The Kafka Report: 2023 Wrap-Up

Apache Flink vs. Kafka Streams: A Comprehensive Comparison

Insights from Current 2024: Reflections from Austin, Texas

Kafka Architecture

Kafka for Dummies ??

1. Given that Apache Kafka is a streaming platform, what's the story around Kafka and Persistence??

2. What's new for Apache Kafka this year?

领英推荐

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

Three decades of laptop computers

Open Source Performance Engineering: Blogs – Part 1

20 years of Open Source from Grid to Cloud Computing

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

Seven Years of Open Source DevRel Technology Fun With Instaclustr

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

Kafka Summit Bangalore 2024 - Interesting Talks

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

社区洞察

其他会员也浏览了

Apache Flink: Real-Time Data Processing at Scale

Kafka with KRaft (Kafka Raft)

Enterprise DataHub

The Kafka Report 006: Latest Kafka Trends, Playbooks, and Resources

Harnessing the Power of Apache Kafka in Real-Time Data Streaming

The Kafka Report: 2023 Wrap-Up

Apache Flink vs. Kafka Streams: A Comprehensive Comparison

Insights from Current 2024: Reflections from Austin, Texas

Kafka Architecture

Kafka for Dummies ??