登录查看更多内容

Demystifying Kafka

Manish Joshi

Senior Software Engineer

发布日期: 2023年5月14日

In today's internet world, log data is a critical component for any sizable internet company. The data generated from user activity events, operational metrics, and system metrics can provide valuable insights into user engagement, system utilization, and other metrics.

Traditionally, log data has been used for analytics purposes to track user engagement and system utilization. However, recent trends in Internet applications have made activity data a part of the production data pipeline. This means that log data is being used directly in site features, providing a more personalized and interactive user experience.

One of the key uses of log data is search relevance, where activity data is used to improve the accuracy of search results. Recommendations are another important use of log data, where user activity is analyzed to identify popular items or items that are commonly viewed together, and these insights are used to provide personalized recommendations to users.

The practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events( logs generated by web service or application); is known as event streaming.

Event Streaming :?

Event streaming is a powerful mechanism that enables companies to manage and process massive amounts of data in real-time. By capturing and processing event streams, companies can store data for later retrieval, manipulate and react to data in real time, and route data to different destinations as needed.

Event streaming is applied to a wide variety of use cases?which include:

To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurance.
To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.
To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
To connect, store, and make available data produced by different divisions of a company.
To serve as the foundation for data platforms, event-driven architectures, and microservices.

What is Kafka?

To enable event streaming we need a platform/infrastructure which allows us to publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems. Which store streams of events durably and reliably for as long as you want, and process streams of events as they occur or retrospectively. And Kafka is one of the platforms used by many companies.

Apache Kafka is a distributed streaming platform that was initially developed by engineers at LinkedIn and later open-sourced in 2011. The project is now maintained by the Apache Software Foundation and has become one of the most widely used technologies for handling real-time data streams.

There are other messaging systems present for processing asynchronous data flows, but they are not a good fit for log processing due to their mismatch in features offered, lack of emphasis on throughput as the primary design constraint, weak distributed support, and performance degradation if messages are allowed to accumulate.

Specialized log aggregators such as Facebook's Scribe, Yahoo's Data Highway, and Cloudera's Flume have been developed to address the limitations of traditional enterprise messaging systems. However, most of these systems are built for consuming log data offline and often expose unnecessary implementation details to the consumer. Additionally, most of them use a "push" model, which may not be suitable for applications that require a "pull" model.

Considering all these existing platforms and their limitation kafka design differently to overcome those limitations.

Architecture :?

Broadly there are two major parts of the Kafka ecosystem, servers and clients:

Servers: Kafka is run as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.

Clients(Producers and consumers): They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

Kafka has three main components:

Brokers: Brokers are the servers that store and process Kafka messages. A broker is a physical or virtual machine that runs the Kafka software. A broker can store messages from one or more topics.
Topics ( data pushed by producers): Topics are logical partitions of messages that are stored on one or more brokers. A topic is like a mailbox that can hold messages. Messages are published to a topic and consumers can subscribe to a topic to receive messages.

A topic has the following characteristics:

A topic is append-only: When a new event message is written to a topic, the message is appended to the end of the log.
Events in the topic are immutable, meaning they cannot be modified once written.
A consumer reads a log by looking for an offset and then reading log entries that follow sequentially.
Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events.

Consumers: Consumers are applications that read messages from topics. Consumers can be implemented in a variety of languages, including Java, Python, and C++.

Kafka uses a publish-subscribe(push to store logs and pull to consume) messaging model, where data is organized into topics and can be partitioned and replicated across multiple brokers. This ensures that data is always available, even in the event of a broker (a server that stores logs data to consumers later by consumers) failure.

Let's understand the other components more in detail.

Producers: Producers are applications that publish messages to topics. Producers can be implemented in a variety of languages, including Java, Python, and C++.

Partitions: Partitions are a way of distributing messages across multiple brokers. Each partition is stored on a single broker. This allows Kafka to scale to handle large volumes of data.

领英推荐

The Evolving Landscape of Data Centers: potential…

Samuel Rene Morillon 2 个月前

The AI Surge: A Global Restructuring for large scaled…

Derek Friend 11 个月前

What Comes After AI for Data Professionals

Robert S. Seiner 10 个月前

Replicas: Replicas are copies of partitions. Replicas are stored on different brokers. This allows Kafka to be highly available in the event of a broker failure.

Leaders: Leaders are the brokers that are responsible for processing messages for a particular partition. A broker can be the leader for one or more partitions.

Followers: Followers are the brokers that store replicas of partitions. Followers do not process messages. They only store replicas of messages.

Zookeeper: Zookeeper is a distributed coordination service that is used by Kafka to manage brokers, topics, and partitions. Zookeeper is a key-value store that stores information about Kafka's state.

Kafka Connect: Kafka Connect is a framework that allows you to connect Kafka to other systems. Kafka Connect can be used to import data from other systems into Kafka or to export data from Kafka to other systems.

Kafka Streams: Kafka Streams is a library that allows you to process data in real-time using Kafka. Kafka Streams can be used to filter, aggregate, and transform data.

How does Kafka store the logs?

When a producer publishes a log to Kafka, it is first stored in a partition. A partition is a way of dividing the log into smaller chunks. This allows Kafka to scale to handle large volumes of data.

The partition that a log is stored in is determined by the key of the log. The key is a unique identifier for the log. If the key is not specified, the log is stored in the first partition.

This example topic in the image has four partitions P1–P4. Two different producer clients are publishing new events on the topic, independently from each other, by writing events over the network to the topic’s partitions. Events with the same key, which are shown with different colors in the image, are written to the same partition. Note that both producers can write to the same partition if appropriate.

Copying logs to multiple partitions

Kafka can copy logs to multiple partitions using a process called replication. Replication is a way of ensuring that the log is available even if a broker fails.

When a log is replicated, it is copied to a number of other brokers. The number of replicas is configured by the user.

The replicas are stored on different brokers. This means that if one broker fails, the log is still available on the other brokers.

What is ZooKeeper and why does Kafka use it?

ZooKeeper is a distributed coordination service that is often used by distributed systems to maintain configuration information and naming and provide distributed synchronization, and group services. It is designed to be highly available, scalable, and fault-tolerant.

Kafka uses ZooKeeper as a distributed coordination service to manage and maintain the Kafka cluster. ZooKeeper is used for various tasks in Kafka, including electing the controller node, storing metadata about topics and partitions, managing consumer offsets, and monitoring Kafka brokers' health.

Here are some specific reasons why Kafka needs ZooKeeper:

Controller election: ZooKeeper is used to elect the controller node in the Kafka cluster. The controller node is responsible for managing the leaders for all partitions in the cluster, among other things. If the current controller fails, ZooKeeper is used to elect a new controller.
Topic and partition metadata: ZooKeeper is used to store metadata about topics and partitions, such as the number of replicas, the location of the replicas, and the current leader for each partition. This information is used by Kafka brokers to determine where to store and retrieve messages.
Consumer group coordination: Kafka consumers can be organized into consumer groups, which allows multiple consumers to work together to consume messages from a topic. ZooKeeper is used to manage the state of consumer groups and track the offset of messages consumed by each group.
Health monitoring: Kafka brokers register themselves with ZooKeeper, and ZooKeeper is used to monitor the health of the brokers. If a broker fails, ZooKeeper can notify other brokers and trigger a reassignment of partitions.

How does LinkedIn use Kafka to process events??

At LinkedIn, Kafka is used for various purposes such as logging, data replication, data processing, and data warehousing. Each data center where user-facing services run has a Kafka cluster that receives log data from frontend services in batches, which is then consumed by online consumers within the same data center.?

A?separate Kafka cluster is deployed in a geographically close data center for offline analysis, where data load jobs pull data from the replica cluster of Kafka into Hadoop and data warehouse for reporting and analytical processes. Kafka is also used for prototyping and ad hoc querying.

Manish Joshi

Senior Software Engineer

Tech knowledge

511 位关注者

Akhilesh Pandey

Software Developer @Jio Platforms Limited | Ex- @Ericsson| JavaScript, React, Spring-Boot, Python

1 年

CI/CD pipelines

要查看或添加评论，请登录

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

2024年6月3日

From Crash to Recovery: The Power of the ARIES Algorithm

The ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) algorithm is a powerful tool used in SQL database…
Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

2024年5月3日

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

In the world of big data, one common problem is the count-distinct elements problem, also known as the cardinality…
Database Locking Strategies: Balancing Act Between Speed and Reliability

2024年2月25日

Database Locking Strategies: Balancing Act Between Speed and Reliability

As a database engineer, one of our paramount concerns is the delicate balancing act between ensuring quick and…

1 条评论
Demystifying Bloom Filters: Simplifying the Complexity

2024年2月17日

Demystifying Bloom Filters: Simplifying the Complexity

Imagine you have a big pile of information, and you want to quickly figure out if a specific thing is in there or not…
The Art of System Design: A Journey with Jinja, Flask, and More!

2024年1月3日

The Art of System Design: A Journey with Jinja, Flask, and More!

How do you truly master system design? The answer lies in getting your hands dirty. While someone can introduce you to…
Best Practices for Creating Change Lists in Code Development

2023年12月22日

Best Practices for Creating Change Lists in Code Development

As a part of the infrastructure team, I often find myself writing a significant amount of code. This could involve…
Mastering Different Programming Languages with Ease

2023年11月24日

Mastering Different Programming Languages with Ease

In today's ever-changing world of software development, being good at different programming languages isn't just about…
From Mystery to Mastery: How Gradient Descent is Reshaping Our World

2023年9月2日

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Gradient descent, a crucial part of modern optimization and machine learning, has a history that goes back a long way…

1 条评论
Stop using std::stack and std::queue, try out the power of std::list

2023年8月17日

Stop using std::stack and std::queue, try out the power of std::list

The std::list container in CPP offers a versatile alternative that can replace not only stack, and queue but also…
Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

2023年8月14日

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

In the previous article Beyond DFS and BFS: Unraveling Advanced Search Techniques in Software Engineering, we embarked…

1 条评论

See all articles

Demystifying Kafka

Manish Joshi

Senior Software Engineer

Event Streaming :?

What is Kafka?

Architecture :?

领英推荐

How does Kafka store the logs?

What is ZooKeeper and why does Kafka use it?

How does LinkedIn use Kafka to process events??

Tech knowledge

511 位关注者

Manish Joshi的更多文章

社区洞察

其他会员也浏览了

The Future of Data Modernization: What’s Next?

The Data Center Series, Second Intermezzo: Making History

Data Center Accelerator Market Size, Share, Industry Trends, Segmentation & Forecast Analysis 2024-2032

Distributed Sensor Networks: The Future of Community-Driven Data Collection

Why I joined Cognite?

Real-Time Data Engineering at Scale: Apache Kafka, Flink, and the Rise of Edge AI

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink's FlinkCEP Library

The Impact of AI on IT Architecture and Emerging Technologies: Trends and Preparations for the Future

Harnessing the Power of Geospatial Data: Revolutionizing Asset Tracking in Web3 Payment Solutions

Data Centers: The Backbone of the Digital Age

Event Streaming :?

What is Kafka?

Architecture :?

领英推荐

How does Kafka store the logs?

What is ZooKeeper and why does Kafka use it?

How does LinkedIn use Kafka to process events??

Tech knowledge

511 位关注者

Manish Joshi的更多文章

From Crash to Recovery: The Power of the ARIES Algorithm

Understanding HyperLogLog: An advanced algorithm to estimate unique count in big data

Database Locking Strategies: Balancing Act Between Speed and Reliability

Demystifying Bloom Filters: Simplifying the Complexity

The Art of System Design: A Journey with Jinja, Flask, and More!

Best Practices for Creating Change Lists in Code Development

Mastering Different Programming Languages with Ease

From Mystery to Mastery: How Gradient Descent is Reshaping Our World

Stop using std::stack and std::queue, try out the power of std::list

Unveiling A* Algorithm: Navigating Advanced Search in Software Engineering

社区洞察

其他会员也浏览了

The Future of Data Modernization: What’s Next?

The Data Center Series, Second Intermezzo: Making History

Data Center Accelerator Market Size, Share, Industry Trends, Segmentation & Forecast Analysis 2024-2032

Distributed Sensor Networks: The Future of Community-Driven Data Collection

Why I joined Cognite?

Real-Time Data Engineering at Scale: Apache Kafka, Flink, and the Rise of Edge AI

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink's FlinkCEP Library

The Impact of AI on IT Architecture and Emerging Technologies: Trends and Preparations for the Future

Harnessing the Power of Geospatial Data: Revolutionizing Asset Tracking in Web3 Payment Solutions

Data Centers: The Backbone of the Digital Age