Following the previous article, we continue to explore key features of Kafka's design that help it achieve the target goals.
This section will focus on Kafka's solutions for message brokers, distribution, and replication.
Reminding the Kafka's design requirements:
- High-throughput
- Gracefully with large data backlogs
- Low-latency delivery
- Guarantees fault tolerance
Kafka producers batch messages and asynchronously send them to the target brokers without routing to reduce latency delivery. The consumers fetch batch messages with self-defined offset to control process rate when working with large data and ensure no messages are lost.
- Producing routing: Producers send messages directly to the target consumer, which is the leader of the target partition. The producers request metadata from any broker to keep updated server status and identify the leader of each partition.
- Asynchronous send: Producers batch messages in memory to send larger requests, improving throughput and efficiency.
- Pull model consuming: Consumers "fetch" requests from brokers leading the target partitions to control when and how much data to fetch. Pull mode prevents consumers from being overwhelmed and enables efficient batching consumption.
- Consumer offset tracking: The consumer specifies its offset in the log and receives logs behind that position. The consumer thus has significant control over the start points and can rewind it to re-consume data if need be.
Kafka topics are divided into partitions, each partition is distributed across different Kafka brokers (a single leader and zero or more followers). This architecture helps it achieve parallelism and load balancing, enabling high throughput and real-time processing.
Kafka replicates data across multiple brokers and automatically elects new leaders when a broker fails to ensure fault tolerance
- Partition: Kafka topics are split into partitions, which can be distributed across multiple brokers. A topic can be processed in parallel to improve throughput. Moreover, this allows horizontal scaling by adding more brokers and more partitions.
- Replication: Kafka replicates each partition across multiple brokers. This means that even if a broker fails, the data is still available from the replicas on other brokers.
- Leader-Follower Model: Each partition has one leader replica and multiple follower replicas. The leader by default handles all read and write operations for that partition. If a leader fails, one of the followers is promoted to be the new leader, ensuring that the partition remains available with minimal downtime.
- In-Sync Replicas (ISR) and Broker Liveness: An ISR is a replica that is up to date with the leader broker for a partition. Via heartbeats and log's delay, the leader checks If a follower fails, gets stuck, or falls behind to remove it from the list of in-sync replicas and add a new one.
- Election algorithm: Any replica in the ISR can become a new leader upon failure. This model requires fewer replicas than a strict quorum approach. If all the nodes replicating a partition die, Kafka waits for the first ISR or non-ISR replica (unclean leader election) to return.
- Committed Messages: Producers can choose how long to wait for acknowledgments (no wait, the leader, or all ISR replicas). Waiting for "all" increases durability but can reduce throughput or availability.
In summary, Kafka’s design thoughtfully addresses the challenges of high-throughput, low-latency data streaming and fault tolerance.
By combining producer-side batching and asynchronous sends with consumer-driven pull models, Kafka ensures that data is delivered efficiently without overwhelming consumers. The consumer group and offset help Kafka consumers know and control where to start or continue reading and processing messages after failures.
Partitioning topics and implementing a leader-follower replication strategy allows Kafka to scale seamlessly and maintain availability despite broker failures. The in-sync replica mechanism further reinforces fault tolerance by dynamically managing the replication state and promoting new leaders with minimal disruption.