Demystifying Kafka
Apache Kafka

Demystifying Kafka

In today's internet world, log data is a critical component for any sizable internet company. The data generated from user activity events, operational metrics, and system metrics can provide valuable insights into user engagement, system utilization, and other metrics.

Traditionally, log data has been used for analytics purposes to track user engagement and system utilization. However, recent trends in Internet applications have made activity data a part of the production data pipeline. This means that log data is being used directly in site features, providing a more personalized and interactive user experience.

One of the key uses of log data is search relevance, where activity data is used to improve the accuracy of search results. Recommendations are another important use of log data, where user activity is analyzed to identify popular items or items that are commonly viewed together, and these insights are used to provide personalized recommendations to users.

The practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events( logs generated by web service or application); is known as event streaming.

Event Streaming :?

Event streaming is a powerful mechanism that enables companies to manage and process massive amounts of data in real-time. By capturing and processing event streams, companies can store data for later retrieval, manipulate and react to data in real time, and route data to different destinations as needed.

Event streaming is applied to a wide variety of use cases?which include:

  • To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurance.
  • To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
  • To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
  • To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.
  • To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
  • To connect, store, and make available data produced by different divisions of a company.
  • To serve as the foundation for data platforms, event-driven architectures, and microservices.

What is Kafka?

To enable event streaming we need a platform/infrastructure which allows us to publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems. Which store streams of events durably and reliably for as long as you want, and process streams of events as they occur or retrospectively. And Kafka is one of the platforms used by many companies.

Apache Kafka is a distributed streaming platform that was initially developed by engineers at LinkedIn and later open-sourced in 2011. The project is now maintained by the Apache Software Foundation and has become one of the most widely used technologies for handling real-time data streams.

There are other messaging systems present for processing asynchronous data flows, but they are not a good fit for log processing due to their mismatch in features offered, lack of emphasis on throughput as the primary design constraint, weak distributed support, and performance degradation if messages are allowed to accumulate.

Specialized log aggregators such as Facebook's Scribe, Yahoo's Data Highway, and Cloudera's Flume have been developed to address the limitations of traditional enterprise messaging systems. However, most of these systems are built for consuming log data offline and often expose unnecessary implementation details to the consumer. Additionally, most of them use a "push" model, which may not be suitable for applications that require a "pull" model.

Considering all these existing platforms and their limitation kafka design differently to overcome those limitations.

Architecture :?

Broadly there are two major parts of the Kafka ecosystem, servers and clients:

Servers: Kafka is run as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.

Clients(Producers and consumers): They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

No alt text provided for this image

Kafka has three main components:

  • Brokers: Brokers are the servers that store and process Kafka messages. A broker is a physical or virtual machine that runs the Kafka software. A broker can store messages from one or more topics.
  • Topics ( data pushed by producers): Topics are logical partitions of messages that are stored on one or more brokers. A topic is like a mailbox that can hold messages. Messages are published to a topic and consumers can subscribe to a topic to receive messages.

A topic has the following characteristics:

  • A topic is append-only: When a new event message is written to a topic, the message is appended to the end of the log.
  • Events in the topic are immutable, meaning they cannot be modified once written.
  • A consumer reads a log by looking for an offset and then reading log entries that follow sequentially.
  • Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events.


  • Consumers: Consumers are applications that read messages from topics. Consumers can be implemented in a variety of languages, including Java, Python, and C++.


Kafka uses a publish-subscribe(push to store logs and pull to consume) messaging model, where data is organized into topics and can be partitioned and replicated across multiple brokers. This ensures that data is always available, even in the event of a broker (a server that stores logs data to consumers later by consumers) failure.

Let's understand the other components more in detail.

Producers: Producers are applications that publish messages to topics. Producers can be implemented in a variety of languages, including Java, Python, and C++.

Partitions: Partitions are a way of distributing messages across multiple brokers. Each partition is stored on a single broker. This allows Kafka to scale to handle large volumes of data.

Replicas: Replicas are copies of partitions. Replicas are stored on different brokers. This allows Kafka to be highly available in the event of a broker failure.

Leaders: Leaders are the brokers that are responsible for processing messages for a particular partition. A broker can be the leader for one or more partitions.

Followers: Followers are the brokers that store replicas of partitions. Followers do not process messages. They only store replicas of messages.

Zookeeper: Zookeeper is a distributed coordination service that is used by Kafka to manage brokers, topics, and partitions. Zookeeper is a key-value store that stores information about Kafka's state.

Kafka Connect: Kafka Connect is a framework that allows you to connect Kafka to other systems. Kafka Connect can be used to import data from other systems into Kafka or to export data from Kafka to other systems.

Kafka Streams: Kafka Streams is a library that allows you to process data in real-time using Kafka. Kafka Streams can be used to filter, aggregate, and transform data.

How does Kafka store the logs?

When a producer publishes a log to Kafka, it is first stored in a partition. A partition is a way of dividing the log into smaller chunks. This allows Kafka to scale to handle large volumes of data.

The partition that a log is stored in is determined by the key of the log. The key is a unique identifier for the log. If the key is not specified, the log is stored in the first partition.

No alt text provided for this image

This example topic in the image has four partitions P1–P4. Two different producer clients are publishing new events on the topic, independently from each other, by writing events over the network to the topic’s partitions. Events with the same key, which are shown with different colors in the image, are written to the same partition. Note that both producers can write to the same partition if appropriate.


Copying logs to multiple partitions

Kafka can copy logs to multiple partitions using a process called replication. Replication is a way of ensuring that the log is available even if a broker fails.

When a log is replicated, it is copied to a number of other brokers. The number of replicas is configured by the user.

The replicas are stored on different brokers. This means that if one broker fails, the log is still available on the other brokers.


What is ZooKeeper and why does Kafka use it?

ZooKeeper is a distributed coordination service that is often used by distributed systems to maintain configuration information and naming and provide distributed synchronization, and group services. It is designed to be highly available, scalable, and fault-tolerant.

Kafka uses ZooKeeper as a distributed coordination service to manage and maintain the Kafka cluster. ZooKeeper is used for various tasks in Kafka, including electing the controller node, storing metadata about topics and partitions, managing consumer offsets, and monitoring Kafka brokers' health.

Here are some specific reasons why Kafka needs ZooKeeper:

  1. Controller election: ZooKeeper is used to elect the controller node in the Kafka cluster. The controller node is responsible for managing the leaders for all partitions in the cluster, among other things. If the current controller fails, ZooKeeper is used to elect a new controller.
  2. Topic and partition metadata: ZooKeeper is used to store metadata about topics and partitions, such as the number of replicas, the location of the replicas, and the current leader for each partition. This information is used by Kafka brokers to determine where to store and retrieve messages.
  3. Consumer group coordination: Kafka consumers can be organized into consumer groups, which allows multiple consumers to work together to consume messages from a topic. ZooKeeper is used to manage the state of consumer groups and track the offset of messages consumed by each group.
  4. Health monitoring: Kafka brokers register themselves with ZooKeeper, and ZooKeeper is used to monitor the health of the brokers. If a broker fails, ZooKeeper can notify other brokers and trigger a reassignment of partitions.

How does LinkedIn use Kafka to process events??

No alt text provided for this image


At LinkedIn, Kafka is used for various purposes such as logging, data replication, data processing, and data warehousing. Each data center where user-facing services run has a Kafka cluster that receives log data from frontend services in batches, which is then consumed by online consumers within the same data center.?

A?separate Kafka cluster is deployed in a geographically close data center for offline analysis, where data load jobs pull data from the replica cluster of Kafka into Hadoop and data warehouse for reporting and analytical processes. Kafka is also used for prototyping and ad hoc querying.


Manish Joshi

Senior Software Engineer

Akhilesh Pandey

Software Developer @Jio Platforms Limited | Ex- @Ericsson| JavaScript, React, Spring-Boot, Python

1 年

CI/CD pipelines

回复

要查看或添加评论,请登录

Manish Joshi的更多文章

社区洞察

其他会员也浏览了