登录查看更多内容

ZERO to HERO in 5 minutes in Apache KAFKA

Nikhil Srivastava

Senior Software Engineer at Confluent

发布日期: 2022年9月4日

+ 关注

[?]?Learning Outcomes From this article:

?? Introduction to Kafka via Bangalore Traffic Analogy

?? Producers, Consumers, Brokers, Partitions

?? Consumer Groups, Replication, Leader-Election (KRaft, Zookeeper)

Make sure to NOT read this article FULLY if you want to screw up your next HLD interview which might require concepts of Kafka :)

[?] Introduction to Kafka via Bangalore Traffic Analogy

Imagine you're an engineer @Noober Technologies and have to design a system to track the functioning of Cabs in various part of Bangalore in a traffic hell depicted below.

?? Such a system would need to update information regarding following:

?? Availability of Cabs in a location

?? Status of Traffic in a location

?? Weather Details of a particular location for pulling money from your pockets

?? Analytics of people dying trying to book a cab

[?] Producers and Consumers in Kafka:

?? In our architecture, we could have a process that reads these data points and updates them to a Queue data structure.

?? Since this process produces data into the queue, let's call it a Producer. Very. Smart. Stuff :P

?? Once, this data is pushed inside the queue, there are other processes that read these data points from queue and present it in form of a UI to frustrated people (trying to book a cab in Blr) or to internal Noober employees for analytics purposes.

?? We call these processes Consumers, since they consume the data points from the queue. Again, very smart.

[?] Partitions, Brokers, Topics, Offsets in Kafka:

?? Here comes the inflection point in our architecture. Overtime, our business grows and we start covering more and more features in the same Noober application.

Following 2 issues arise:

?? Our central queue running on 1 server becomes overloaded with damn too many requests and begins to choke.

?? Consumers & Producers also start suffering, since they need to incorporate the sudden failures in the Queue.

At this stage, we would want to distribute our Queue data structure over multiple servers. But how do we do that?

?? One way could be distribute the contents of the queue into multiple queues randomly.

?? An important problem with randomly distributing the data into multiple queues:

?? Suppose you want analytics for Sarjapur Cab Traffic.

Me, who could never book a cab in evening from Sarjapur, talking about Analytics.

?? Then, as a consumer process - you'll have to consume the traffic status from queue-1 and cab trip details from queue - 2.

?? Post which, you'll have to write business logic, to merge these results to provide a good customer experience, since data is coming from different queues.

This is unnecessary EXTRA work and Kafka Partition Key feature helps avoid this issue

?? Partition Key feature in Kafka, ensures that all data points with a particular hash value enter the SAME queue / partition.

领英推荐

Introduction to Apache Kafka

Brij kishore Pandey 9 个月前

Case Study: Kafka Async Queuing with Consumer Proxy

Vivek Bansal 10 个月前

Kafka Eco System

Arabinda Mohapatra 7 个月前

?? Hence, all messages relevant to Sarjapur location would enter the SAME queue and hence no need to merge results at the consumer end :)

Partitions are ways to group data based on some logical parameter. In our case, it was the hash of the location - Sarjapur.

In Kafka, each of these queues is called a Partition.
Each server that runs one or more of these queues / partitions, is called a Broker.
Group of these queues / partitions that contain similar data, is called a Topic.
Each record in a queue / partition is identified using a Sequence Number or an Offset.

[?] Consumer Groups, Replication, Tiered Storage in Kafka:

?? Now, here comes the parallelism and concurrency bit in Kafka.

?? Since we have distributed the data points into multiple queues in our system, we could run multiple consumers reading the data from different queues.

An important point regarding Concurrency and Parallelism

???? The unit of parallelism in Kafka is the number of Partitions. Hence, if you have 10 partitions in a topic, then you can have 10 consumer processes reading data from these 10 partitions concurrently.

?? Another important concept, is that of a Consumer Group.

?? Imagine that the 2 personas:

Customers trying to book a Cab
Internal Employees running analytics on data

are trying to consume the SAME data from SAME set of queues/partitions.

? Kafka provides a functionality of putting all consumer processes of Customers into 1 Consumer Group and all consumer processes of Internal Employees into another Consumer Group.

?? 2 consumer processes in 2 different consumer groups, can have their own tracking of offsets. The red circles are consumers from Red Consumer Group and same for the blue ones.

?? For example: The red arrow signifies, till what position in the queue, a red consumer has read the messages. And the red pointer is completely independent of the blue pointer.

So, the red consumers and blue consumers can consume same messages at their own respective pace hence making Kafka highly scalable in-terms of consuming messages as well.

?? This depicts how easily, Kafka solves the concurrency and parallelism problem using Consumer Groups and associated offset tracking system.

???? Important:

Any 2 consumers belonging to a single Consumer Group, NEVER read from SAME partition. Otherwise, we'd have to apply locks etc. and that would disable Kafka from being a HIGH THROUGHPUT system.

? Another capability of Kafka, is to ensure that the data residing in the queues can be deleted after a configured period of time.

? Confluent introduced a feature called, Tiered Storage that let's you persist data in Kafka across different storage tiers like cheap store (Blob stores), expensive stores (Caches), disks (Append-only-logs on Disks) depending on how long data has resided in the system and it's associated access pattern.

? Anyway, finally - data inside each partition is replicated across multiple brokers.

So, Kafka keeps multiple copies of same data on different machines, to ensure fault tolerance and durability of the system.

A replication factor of 3, implies Kafka would keep 3 copies of same data on 3 different brokers.

[?] Leader Election in Kafka (KRaft, Zookeeper):

?? Since, there are multiple brokers that store same copy of a queue, hence at any point of time, there is 1 leader broker (the one, which is most up-to-date wrt to the latest writes to the system).

?? Other brokers are categorised as followers. In situations where the existing leader fails, then one of the followers is promoted to be a leader and leader election algorithms are put to place for the same.

?? There is a very interesting KRaft protocol, that solves this problem. Earlier this job was pursued by Zookeeper. I'll try to document the leader election algorithm via both these ways in future articles.

?? Thankyou very much for making it thus far. Hope you learnt a thing or two about Apache Kafka, if you're a beginner :)

??Please comment if you have any questions or found something incorrect !!

???? Godspeed !!

Ramesh Babu

Engineer | Writer

2 年

please write article about blocking, non-blocking I/O.. lot of confusion about this..

Kuldeep Jangir

2 年

Wonderful article. You are impeccable . G.OA.T.

1 次回应

Puneet Rajwani

Software Engineering Specialist @ Dassault Systèmes | Full-Stack Developer

2 年

Nikhil Srivastava this is one of the best way I've come through where you explain a concept in such a smooth way that the person who have no knowledge about the concept understand every bit of it with so ease. Kudos to the great work and explanation And thanks for sharing.

1 次回应

Subham Sahu

Engineer | IIT Ropar

2 年

Nikhil Srivastava, thank you for such a nice article ??. It clarified many things and revised a few things I had forgotten. I would appreciate it if you could verify the following: 1. I find Message Queue, Pub/Sub kind of the same. The only difference being the latter supports “multiple consumers” to receive the same message. (and order is also guaranteed). So Pub/Sub is the general case of MQ. 2. It means that Pub/Sub must store the message in the buffer (for a longer time) for all subscribers to read. 3. Kafka is a Pub/Sub (it means it can be used as a message queue). An oversimplification, but I think the following are the USPs of Kafka: 1. It replicates the data across multiple “brokers” (no SPOF). 2. We can have a consumer group (consisting of multiple consumers) read from the same topic & split the workload. (kind of like typical Rabbit MQ). (for max efficiency: Number of consumers = No. of partitions). 3. We can have as many consumer groups as there’s no competition between the two groups. (typical pub/sub) 4. Kafka coordinator must be keeping track of consumers efficiently via health checks etc. (so a USP indeed). Did I infer anything wrong, or would you like to add something? ??? Thank you. ??

3 次回应

Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

? Let's connect over a 1x1 here regarding SWE interviews, Kafka, LLD :?https://topmate.io/nikhil_srivastava ? If you love my style of writing, then subscribe to my newsletter here !! https://www.dhirubhai.net/newsletters/lld-concurrency-hld-6961694694581964800/

2 次回应

查看更多评论

要查看或添加评论，请登录

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

2023年6月9日

Apache KAFKA Connect 101 - Part (1/2)

?? Learning outcomes from this article: ? Typical Kafka Usage with Producer and Consumer Clients: ? Kafka Connect 101:…
Gentle Intro to Data Streaming Landscape

2023年5月31日

Gentle Intro to Data Streaming Landscape

Learning outcomes from this article: What is stream computing and why should YOU learn about it? Typical streaming…
Design of Real Time Analytics for 69x Price Hike in Uber !

2022年9月18日

Design of Real Time Analytics for 69x Price Hike in Uber !

?? We'll explore the engineering design philosophy behind features like: ?? The 69x surge-pricing while booking Uber…

7 条评论
Kafka Replication Protocol

2022年9月12日

Kafka Replication Protocol

In this article, I'm going to explain you the Kafka Data Replication Protocol with a very simple example. [?] Learning…

9 条评论
Ripping apart the Kafka Broker Architecture !!

2022年8月28日

Ripping apart the Kafka Broker Architecture !!

?? Just like you, even I love distributed systems. Apache Kafka is such a beautifully designed distributed system that,…

11 条评论
Damn, Threads in web server too !!

2022年8月21日

Damn, Threads in web server too !!

?? Chronological learning outcomes from this article. Important low level and high level details about the functioning…

17 条评论
Ditch Multi-Threading !! Learn CPU and memory architecture first

2022年8月15日

Ditch Multi-Threading !! Learn CPU and memory architecture first

?? Don't risk your concurrency interviews without having good grasp on fundamentals. Read further and become a better…

14 条评论
Kubernetes-101 as a noob, from a noob, like a noob !!

2022年8月13日

Kubernetes-101 as a noob, from a noob, like a noob !!

?? My intention with this article is to help you form some intuition about Kubernetes, Containers, VMs etc. So that you…

14 条评论
Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

2022年8月7日

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

? Now that I have your attention, if you're preparing for LLD interviews, read further to solidify your understanding…

7 条评论

See all articles

ZERO to HERO in 5 minutes in Apache KAFKA

Nikhil Srivastava

Senior Software Engineer at Confluent

[?]?Learning Outcomes From this article:

[?] Introduction to Kafka via Bangalore Traffic Analogy

?? Such a system would need to update information regarding following:

[?] Producers and Consumers in Kafka:

[?] Partitions, Brokers, Topics, Offsets in Kafka:

?? One way could be distribute the contents of the queue into multiple queues randomly.

?? An important problem with randomly distributing the data into multiple queues:

领英推荐

[?] Consumer Groups, Replication, Tiered Storage in Kafka:

???? Important:

[?] Leader Election in Kafka (KRaft, Zookeeper):

Nikhil Srivastava的更多文章

社区洞察

其他会员也浏览了

Learn Kafka In Just 5 minutes

Kafka Basics

Kafka Simplified

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Kafka: Core Concepts and Use Cases

Advanced Concepts in Apache Kafka

?? Apache Kafka Internals-Part1

Mirroring High-Throughput Topics with Kafka MirrorMaker 2

A Comprehensive Analysis: Apache Kafka

Transform Your Data Strategy with Apache Kafka

[?]?Learning Outcomes From this article:

[?] Introduction to Kafka via Bangalore Traffic Analogy

?? Such a system would need to update information regarding following:

[?] Producers and Consumers in Kafka:

[?] Partitions, Brokers, Topics, Offsets in Kafka:

?? One way could be distribute the contents of the queue into multiple queues randomly.

?? An important problem with randomly distributing the data into multiple queues:

领英推荐

[?] Consumer Groups, Replication, Tiered Storage in Kafka:

???? Important:

[?] Leader Election in Kafka (KRaft, Zookeeper):

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

Gentle Intro to Data Streaming Landscape

Design of Real Time Analytics for 69x Price Hike in Uber !

Kafka Replication Protocol

Ripping apart the Kafka Broker Architecture !!

Damn, Threads in web server too !!

Ditch Multi-Threading !! Learn CPU and memory architecture first

Kubernetes-101 as a noob, from a noob, like a noob !!

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

社区洞察

其他会员也浏览了

Learn Kafka In Just 5 minutes

Kafka Basics

Kafka Simplified

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Kafka: Core Concepts and Use Cases

Advanced Concepts in Apache Kafka

?? Apache Kafka Internals-Part1

Mirroring High-Throughput Topics with Kafka MirrorMaker 2

A Comprehensive Analysis: Apache Kafka

Transform Your Data Strategy with Apache Kafka