?? Apache Kafka Internals-Part1
Kafka's internal architecture is designed to provide a robust, durable, and high-performance platform for handling real-time data feeds. Its distributed nature allows it to scale out and handle very large volumes of data while ensuring that messages are processed in the order they are received. The combination of these features makes Kafka a powerful tool for building real-time streaming data pipelines and applications.
Kafka's architecture is divided into two main layers: the storage layer and the compute layer.
?? What is the Apache ZooKeeper?
Apache ZooKeeper is a centralized service that is crucial for distributed applications. It provides a reliable way to coordinate and manage a cluster of machines. At its core, ZooKeeper keeps track of data in a tree-like structure called znodes, which can be thought of as files and directories. Znodes can be regular or ephemeral; regular znodes persist until they are explicitly deleted, while ephemeral znodes disappear when the session that created them ends.
ZooKeeper ensures that the entire distributed system has a consistent view of the data. It does this by following a simple protocol: when a process changes the data in a znode, ZooKeeper replicates the change across all nodes in the cluster, ensuring that all nodes see the same view of the data.
Processes can set watches on znodes. When a watched znode changes, ZooKeeper notifies all the watching processes of the change. This feature is particularly useful for coordination tasks like leader election, configuration management, and cluster management.ZooKeeper's design aims to be simple. It doesn't handle large data values but focuses on managing synchronization primitives such as locks and barriers efficiently.
?? What is the Cluster Membership?
Kafka cluster membership refers to the collection of brokers that form a Kafka cluster. These brokers work together to provide a distributed, fault-tolerant, and scalable messaging platform. Each broker in the cluster has a unique identifier, which is used to manage and maintain the cluster's overall health and performance. The brokers coordinate with each other to handle multiple data streams, ensuring low latency, data durability, and scalability.
An ephemeral node in the context of Kafka and ZooKeeper is a temporary node that is automatically removed when the session that created it ends. This is often used to represent the presence of a broker in the cluster; when the broker is active, its ephemeral node exists in ZooKeeper, and if the broker disconnects or crashes, the ephemeral node is deleted.
When a broker is disconnected from ZooKeeper, the ephemeral node associated with that broker is removed. This signals to the rest of the Kafka cluster that the broker is no longer available. As a result, Kafka will trigger a rebalance of the partitions that were managed by the disconnected broker, electing new leaders for those partitions if necessary.
If consumers or producers were using the disconnected broker, they would be affected by the disconnection. Kafka clients are designed to handle such failures by detecting broker unavailability and reconnecting to other available brokers in the cluster. They may experience a temporary disruption but will continue to function by connecting to other brokers.
Adding a new broker to the Kafka cluster can serve as an alternative pathway for producers and consumers. When a new broker is added, it is assigned a unique broker.id and configured to connect to the existing ZooKeeper ensemble. The new broker can then start handling new partitions and replicas. To balance the cluster load, partitions may be reassigned to the new broker using Kafka's partition reassignment tools. This process is dynamic and does not require a restart of the existing brokers or ZooKeeper nodes.
?? What is the Controller?
The Controller in a Kafka cluster is a broker with the responsibility of managing the states of partitions and replicas. It is a critical component that performs administrative tasks such as reassigning partitions and handling broker failures. There can only be one active controller at any time, and it is elected from among the brokers in the cluster. The controller's duties include:
A ZooKeeper watch is a notification mechanism that allows Kafka brokers to be informed about changes in the cluster state. For example, if a broker goes down, ZooKeeper can notify the other brokers through a watch, triggering the controller to elect a new leader for the affected partitions.
??What is the Replication?
Replication in Kafka is a fundamental feature that ensures data durability and high availability. It involves creating multiple copies of data across different brokers within the Kafka cluster. Each topic in Kafka is divided into partitions, and these partitions are replicated across the cluster's brokers based on a configurable replication factor. The replication factor determines how many copies of each partition are made.
In the context of replication, each partition has one special broker known as the leader. The leader handles all read and write requests for that partition. Other brokers that hold copies of the partition act as followers. The followers replicate the data from the leader and can take over as the new leader if the current leader fails. This mechanism ensures that even in the event of a broker failure, the data remains available and the system can continue to operate without data loss.
The leader is crucial because it maintains the integrity and consistency of the data. It ensures that all followers are up-to-date before confirming writes to the clients. This process is managed through a set of in-sync replicas (ISRs), which are the followers that have fully replicated the leader's log of messages.
There are two types of replicas:
1. Leader Replica: For each partition, one of the replicas is designated as the leader. The leader replica handles all read and write requests for that partition. It is also responsible for updating the followers with the latest data.
2. Follower Replica: These are the replicas that replicate the data from the leader. They do not serve client requests directly. Instead, they synchronize with the leader, ensuring that they have the same data as the leader.
?? What is the Request?
Requests are structured communications between clients (like producers and consumers) and the Kafka brokers. These requests are part of Kafka's binary protocol over TCP, which defines a set of API operations that clients can perform on the Kafka cluster.
Requests in Kafka are made up of a standard format that includes a header and a body. The header typically contains information such as the API key (which identifies the type of request), API version (which specifies the version of the protocol to be used), a correlation ID (a unique identifier for the request provided by the client), and a client ID (an identifier for the client making the request).
The Kafka protocol supports a variety of request types, including but not limited to:
The request processing involves several components that work together to handle client requests and responses. Here's a simplified explanation of each term:
?? Producer Request:
?? Fetch Request: