“Why Persist When You Can Stream?”
DM Radio show December 15, 2022 at 3 PM ET
It was nice to be back on DM Radio a year from my first appearance in 2021 (“You Can Step in the Same Stream Twice”, December 16, 2021). This time the topic was “Why Persist When You Can Stream?” with the host Eric Kavanagh asking the questions and myself and Gary Hagmueller from arcion.io as guests.? As usual, Eric was a dynamic host and the discussion was nicely free ranging, but I may not have been very coherent as I had to get up with the sun in Canberra as the show was live in the USA, so here are the notes that I had prepared earlier if you want to find out more!
1. Given that Apache Kafka is a streaming platform, what's the story around Kafka and Persistence??
Apache? Kafka is a distributed publish-subscribe system for streaming data from producers to consumers via Kafka brokers and topics.? It’s great for low-latency high-volume applications - i.e. streaming data. But, does it use persistence internally, or make it easier to work with persistent data? Yes to both!
Kafka persists messages on the broker disks for data durability (zero message loss), replication (high availability and failure tolerance), fan-out for delivery of the same message to multiple consumer groups, and replaying events - this is a Kafka super power with many use cases.? And messages can be retained for as long as you like
Kafka microservices (built from consumers) often use a persistent backing store (e.g. PostgreSQL, Cassandra) if they need access to stateful data (E.g. Anomalia Machina Anomaly Detection blog series, and geospatial anomaly detection mini-series, links on this page and this )
The Kafka Streams API is used to build powerful streaming applications to combine data from multiple topics and create new data streams.? State stores are a type of light-weight database which compute state from events and allow queries from Streams applications. State is created from event streams using event-sourcing - which is relevant to Uber’s Cadence also (E.g.? Cluedo Kafka streams example, IoT Logistics demo application truck overloading example).
The Kafka Connect API enables easy integration with diverse source and sink systems via Kafka. These are often databases or data analytics systems (E.g. Kafka connect pipeline series, which consumed NOOA Tidal Data, and sent it to either OpenSearch or PostgreSQL for visualization by Kibana/Apache Superset).
But if you want to use a database as a source system, you need to do something clever to turn stateful data into change data events. Debezium is an open source technology for this - it typically runs in the source database and use Kafka connect to send the change data stream to Kafka and then to other downstream systems. (E.g. Debezium for PostgreSQL and Cassandra blogs).
Up until very recently Kafka actually relied on an external system for meta-data storage and to coordinate the Kafka brokers - another Apache project, Zookeeper - explicitly designed for distributed systems coordination.? More about that next.
2. What's new for Apache Kafka this year?
At the ApacheCon New Orleans conference in October this year I actually gave two talks on ZooKeeper which were more or less contradictory. In the first talk, I advocated ZooKeeper as a powerful and mature open source technology for distributed systems management. In the second talk I explained why Apache Kafka was abandoning the zookeeper so it is no longer needed (for Kafka at least)!
So the big change for Kafka this year is akin to a brain transplant! Instead of the external Zookeeper service a new internal protocol called KRaft (Kafka Raft protocol) has replaced it for meta-data management/storage and broker elections.?
KRaft uses Kafka itself for meta-data storage. It will be faster, more scalable, and cheaper/easier to manage as you won’t have to worry about running an external Zookeeper service anymore.?
Instaclustr will have managed Kafka with KRaft available soon, but in the meantime, I did some experiments myself.
I found that the new KRaft mode is a lot faster to create Kafka topics with many partitions (1s c.f. > 10s of seconds) - partitions are the main concurrency mechanism in Kafka. The more partitions a topic has, the more concurrency, consumers and therefore message throughput it can handle.
领英推荐
If a server fails you need to move partitions to another broker - it’s a lot faster now? (42s c.f. 600s) to move 10,000 partitions.?
And finally it’s possible to now create topics with a very large number of partitions, potentially more than 1 Million! So you can create some very big and high throughput Kafka clusters if you need to. Many of our Kafka customers already have very big clusters and the challenge of growing workloads so this “brain transplant” is likely to be important to them.??
[In the context of the show, I talked about this change in the context of more general economic and technical challenges requiring optimisation and efficiency gains].
3. Have you come across any new Use Cases for Kafka this year, or any new interesting open source technologies, or both?
I also talked about the interesting Drone delivery demo application I wrote, blogged and talked about this year. It was built using a combination of a new open source workflow engine, Uber’s Cadence, and Apache Kafka.? I thought it was an interesting example for this show, combining streaming data and persistent data (and state). In one of my blogs I called it the “ballet” integration pattern - a way of combining orchestration (workflow) and choreography (streaming) systems into one unified real-time event-based application which is also stateful, fault-tolerant, and potentially long-running. Also enabling systems running in different time-frames to interact - streaming data is fast and high volume, but workflows are sequence/order based, and long-running with scheduled steps.??
Cadence is a workflow engine, good for orchestrating multiple sequenced, stateful, scheduled, and long-running steps.? It’s highly scalable and fault tolerant. It actually uses event-sourcing to achieve both of these (similar to Kafka streams state stores), with state written/read to/from Postgresql or Cassandra database.? Also Kafka+OpenSearch for enhanced workflow visibility and management. So again, like Kafka, it uses both persistent data and streaming data internally.?
Cadence is a good match for the Drone delivery problem, as there are lots of steps in the delivery process to get the drone from its base to the customer and back again, including computing drone movement from one location to the next, and managing the orders, and you want the software to be fault-tolerant so drones don’t get lost or just drop out of the sky - they have enough problems with bird attacks already.
I modeled both drones and order deliveries as Cadence workflows - each drone has a Drone workflow (which just repeats), but there’s a new Order workflow for each order.
Cadence is an example of a persistent orchestration system, so where does streaming fit? Well, I wanted to explore ways to integrate it with Kafka. There were a couple of things I tried out.
First, starting the delivery process is event-driven. Once a customer places an order an event is sent to Kafka which then sends another event to Cadence to start a new Order workflow. This notifies the donut shop about the new order.? This is pretty simple, but how do we coordinate the drones and the orders, as the different workflows don’t know about each other, and you want exactly one drone to collect each order, as fast as possible.?
I introduced a new Kafka microservice which matches ready orders and drones. Once an order is ready, the Order workflow sends a message to Kafka, and when a drone is charged and ready to go, it also sends a message to Kafka. The microservice matches a ready order with a ready drone and sends a message to that drone using a Cadence “signal” - a Cadence event mechanism - and the drone workflow proceeds to manage the flight to the donut shop, picking up the donuts, delivering them to the customer, and return to base. Hopefully avoiding hungry birds. All the while it sends messages to the Order workflow as well updating location data, so the customer can track the delivery and get ready to receive their donuts, completing the order. The drone flies back to base and recharges and is then ready for the next order. Note that it’s a bit tricky to explain this without a diagram, here’s the one from the final blog:
?
So this was a successful demonstration of integrating two different but complementary styles of big data architectures - Cadence workflow orchestration and Kafka streaming choreography, achieving a nice Drone "Ballet".
How well did it work? With the prototype, we managed to get 2,000 (simulated) drones flying at once. But it was realistic enough that one of my work colleagues checked to see if I had obtained a drone flying license.