Demystifying Kafka
In today's internet world, log data is a critical component for any sizable internet company. The data generated from user activity events, operational metrics, and system metrics can provide valuable insights into user engagement, system utilization, and other metrics.
Traditionally, log data has been used for analytics purposes to track user engagement and system utilization. However, recent trends in Internet applications have made activity data a part of the production data pipeline. This means that log data is being used directly in site features, providing a more personalized and interactive user experience.
One of the key uses of log data is search relevance, where activity data is used to improve the accuracy of search results. Recommendations are another important use of log data, where user activity is analyzed to identify popular items or items that are commonly viewed together, and these insights are used to provide personalized recommendations to users.
The practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events( logs generated by web service or application); is known as event streaming.
Event Streaming :?
Event streaming is a powerful mechanism that enables companies to manage and process massive amounts of data in real-time. By capturing and processing event streams, companies can store data for later retrieval, manipulate and react to data in real time, and route data to different destinations as needed.
Event streaming is applied to a wide variety of use cases?which include:
What is Kafka?
To enable event streaming we need a platform/infrastructure which allows us to publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems. Which store streams of events durably and reliably for as long as you want, and process streams of events as they occur or retrospectively. And Kafka is one of the platforms used by many companies.
Apache Kafka is a distributed streaming platform that was initially developed by engineers at LinkedIn and later open-sourced in 2011. The project is now maintained by the Apache Software Foundation and has become one of the most widely used technologies for handling real-time data streams.
There are other messaging systems present for processing asynchronous data flows, but they are not a good fit for log processing due to their mismatch in features offered, lack of emphasis on throughput as the primary design constraint, weak distributed support, and performance degradation if messages are allowed to accumulate.
Specialized log aggregators such as Facebook's Scribe, Yahoo's Data Highway, and Cloudera's Flume have been developed to address the limitations of traditional enterprise messaging systems. However, most of these systems are built for consuming log data offline and often expose unnecessary implementation details to the consumer. Additionally, most of them use a "push" model, which may not be suitable for applications that require a "pull" model.
Considering all these existing platforms and their limitation kafka design differently to overcome those limitations.
Architecture :?
Broadly there are two major parts of the Kafka ecosystem, servers and clients:
Servers: Kafka is run as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.
Clients(Producers and consumers): They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.
Kafka has three main components:
A topic has the following characteristics:
Kafka uses a publish-subscribe(push to store logs and pull to consume) messaging model, where data is organized into topics and can be partitioned and replicated across multiple brokers. This ensures that data is always available, even in the event of a broker (a server that stores logs data to consumers later by consumers) failure.
Let's understand the other components more in detail.
Producers: Producers are applications that publish messages to topics. Producers can be implemented in a variety of languages, including Java, Python, and C++.
Partitions: Partitions are a way of distributing messages across multiple brokers. Each partition is stored on a single broker. This allows Kafka to scale to handle large volumes of data.
领英推荐
Replicas: Replicas are copies of partitions. Replicas are stored on different brokers. This allows Kafka to be highly available in the event of a broker failure.
Leaders: Leaders are the brokers that are responsible for processing messages for a particular partition. A broker can be the leader for one or more partitions.
Followers: Followers are the brokers that store replicas of partitions. Followers do not process messages. They only store replicas of messages.
Zookeeper: Zookeeper is a distributed coordination service that is used by Kafka to manage brokers, topics, and partitions. Zookeeper is a key-value store that stores information about Kafka's state.
Kafka Connect: Kafka Connect is a framework that allows you to connect Kafka to other systems. Kafka Connect can be used to import data from other systems into Kafka or to export data from Kafka to other systems.
Kafka Streams: Kafka Streams is a library that allows you to process data in real-time using Kafka. Kafka Streams can be used to filter, aggregate, and transform data.
How does Kafka store the logs?
When a producer publishes a log to Kafka, it is first stored in a partition. A partition is a way of dividing the log into smaller chunks. This allows Kafka to scale to handle large volumes of data.
The partition that a log is stored in is determined by the key of the log. The key is a unique identifier for the log. If the key is not specified, the log is stored in the first partition.
This example topic in the image has four partitions P1–P4. Two different producer clients are publishing new events on the topic, independently from each other, by writing events over the network to the topic’s partitions. Events with the same key, which are shown with different colors in the image, are written to the same partition. Note that both producers can write to the same partition if appropriate.
Copying logs to multiple partitions
Kafka can copy logs to multiple partitions using a process called replication. Replication is a way of ensuring that the log is available even if a broker fails.
When a log is replicated, it is copied to a number of other brokers. The number of replicas is configured by the user.
The replicas are stored on different brokers. This means that if one broker fails, the log is still available on the other brokers.
What is ZooKeeper and why does Kafka use it?
ZooKeeper is a distributed coordination service that is often used by distributed systems to maintain configuration information and naming and provide distributed synchronization, and group services. It is designed to be highly available, scalable, and fault-tolerant.
Kafka uses ZooKeeper as a distributed coordination service to manage and maintain the Kafka cluster. ZooKeeper is used for various tasks in Kafka, including electing the controller node, storing metadata about topics and partitions, managing consumer offsets, and monitoring Kafka brokers' health.
Here are some specific reasons why Kafka needs ZooKeeper:
How does LinkedIn use Kafka to process events??
At LinkedIn, Kafka is used for various purposes such as logging, data replication, data processing, and data warehousing. Each data center where user-facing services run has a Kafka cluster that receives log data from frontend services in batches, which is then consumed by online consumers within the same data center.?
A?separate Kafka cluster is deployed in a geographically close data center for offline analysis, where data load jobs pull data from the replica cluster of Kafka into Hadoop and data warehouse for reporting and analytical processes. Kafka is also used for prototyping and ad hoc querying.
Manish Joshi
Senior Software Engineer
Software Developer @Jio Platforms Limited | Ex- @Ericsson| JavaScript, React, Spring-Boot, Python
1 年CI/CD pipelines