Internal Architecture of Kafka

Internal Architecture of Kafka

Apache Kafka is a powerful event streaming platform that enables developers to process and respond to data events in real time.

Its architecture has two layers -the storage and compute layers which are designed to handle large-scale, distributed systems. The storage layer is optimized for efficient data retention and scales horizontally, making it easy to expand storage capacity as needed.

On the other hand, the compute layer handles data processing and interaction, comprising four key components: Producers, Consumers, Streams, and Connector APIs. These components work together to ensure that Kafka can scale distributed applications while managing real-time data streams effectively.

Architecture Diagram


Overview of Kafka Architeture

This Kafka architecture diagram outlines the separation between the Compute Layer and the Storage Layer of the Kafka system, highlighting the interaction between the core components used for data streaming and processing.

1. Compute Layer

This layer is responsible for creating, processing, and consuming data streams in real-time, interacting with the Kafka cluster and the storage layer.

  • Kafka Streams API: It is a Java library designed for processing data streams in real-time. Built on top of the Producer and Consumer APIs, it enables developers to process events as they are generated, allowing for dynamic transformations, filtering, and aggregations of event data. With Kafka Streams, you can build complex stream processing applications that operate in real-time, efficiently handling high volumes of data.
  • Kafka Connect API: Built on top of the Producer and Consumer APIs, Source connectors are used to pull data from external sources and publish it to Kafka topics, while Sink connectors take data from Kafka and push it into external systems, enabling smooth interoperability.
  • Consumer API: It is responsible for reading kafka events.
  • Producer API: It handles the writing of events to Kafka.

2. Storage Layer

The storage layer in Kafka ensures data durability, fault tolerance, and scalability. This is where Kafka's distributed system comes into play.

  • Kafka Cluster: The central part of the storage layer, consisting of multiple brokers. These brokers are responsible for handling data streams, storing them, and distributing data across the cluster for fault tolerance and high availability.
  • Zookeeper: Zookeeper manages the Kafka cluster’s metadata, such as configurations, leader elections, and broker management. It acts as a coordination service to ensure smooth communication and fault tolerance within the Kafka brokers.


Internal Architecture of Kafka Broker

Internal Architecture of Broker

This diagram represents the internal architecture of a Kafka Broker and outlines how client requests are processed within the system.

When the Kafka Client sends a request to the broker, the request is first received by the Socket Receive Buffer. This is a memory buffer that temporarily holds the incoming data from clients before it is processed.

Network Threads picks it up and processes the request. The thread reads the data from the buffer and decides if it is a produce request (writing data) or a fetch request (retrieving data). The request can be of two types:

- Produce Requests: These are requests to write a batch of data to a specific Kafka topic.

- Fetch Requests: These are requests to read data from a Kafka topic.

The Request Queue holds incoming produce requests. These requests are picked up by I/O threads for further processing. The queue ensures that requests are processed in the order they were received, ensuring fairness and maintaining data integrity.

The I/O Threads are responsible for reading and writing data to and from disk. After picking up a produce request from the request queue, the I/O thread performs several key functions:

  • Validation: The data is verified, including checking the CRC (Cyclic Redundancy Check) for data corruption.
  • Storing Data: The thread writes the data into the Kafka partition’s Commit Log (physical storage) and ensures that the records are stored in the proper log segments.

Kafka uses an in-memory Page Cache to buffer data before it is written to disk. This helps reduce the number of disk I/O operations and improves performance by keeping frequently accessed data in memory.

Once data is written into the page cache, it is eventually flushed to disk for persistent storage. Kafka organizes its on-disk storage using commit logs and segments:

  • Log files (.log) store the actual records.
  • Index files (.index) map record offsets to specific positions within the log.

Tiered Fetch Threads handle reading data from different layers of storage (in-memory cache, on-disk storage, or even cloud-based object stores) and serve it back to the client. These threads help in responding to fetch requests efficiently by managing data across different storage tier

Kafka employs a purgatory structure to handle requests that cannot be processed immediately. This typically occurs with produce requests waiting for replication across brokers or fetch requests waiting for sufficient data to be available.

  • Requests remain in purgatory until their conditions are met.
  • Once conditions are satisfied (e.g., sufficient replication or available data), the request exits purgatory and proceeds to the next step.


Kafka ensures fault tolerance and data consistency using replication. The broker coordinating the produce request must ensure that data is replicated to other Kafka brokers. Until replication is completed across the necessary brokers, the produce request remains in purgatory. Once completed, the broker acknowledges the client.

After processing a client request, a response (such as acknowledgement of a write or delivering fetched records) is added to the Response Queue. Each network thread maintains its own response queue to manage client responses efficiently.

Kafka can integrate with external Object Stores (like AWS S3) for long-term storage of large data sets. The Tiered Fetch Threads handle requests that need data from these external stores, providing an efficient mechanism to scale Kafka 0storage beyond the capacity of the local disk.

Kafka Data Replication and Leader-Follower Dynamics

By distributing data across multiple brokers, Kafka maintains fault tolerance and high availability, even in the case of broker failures.

Replication is configured at the topic level. When creating a topic, users can define the number of replicas for each partition (replication factor)

A replication factor of "N" allows Kafka to withstand up to "N-1" broker failures without losing data or sacrificing availability.


Image source - confluent.io

Once replicas are established for each partition of a topic, one of them is designated as the leader replica, and the broker that holds this replica is responsible for managing reads and writes to the partition. The remaining replicas are referred to as followers, which replicate data from the leader to stay in sync

The In-Sync Replica (ISR) set includes the leader and all followers that are fully caught up with the leader’s data. Ideally, all replicas remain part of the ISR to ensure data consistency and availability.

Kafka Consumer Groups will be covered in the next article.


Special thanks to Jun Rao for explaining the concepts so well in confluent.io architecture documentation.

To receive new posts and support my work, subscribe to the newsletter.


Resources

https://docs.confluent.io/kafka/introduction.html

https://kafka.apache.org/documentation/






要查看或添加评论,请登录

Deboshree Choudhury的更多文章

社区洞察

其他会员也浏览了