ZERO to HERO in 5 minutes in Apache KAFKA

ZERO to HERO in 5 minutes in Apache KAFKA

[?]?Learning Outcomes From this article:

?? Introduction to Kafka via Bangalore Traffic Analogy
?? Producers, Consumers, Brokers, Partitions
?? Consumer Groups, Replication, Leader-Election (KRaft, Zookeeper)

Make sure to NOT read this article FULLY if you want to screw up your next HLD interview which might require concepts of Kafka :)

No alt text provided for this image

[?] Introduction to Kafka via Bangalore Traffic Analogy

Imagine you're an engineer @Noober Technologies and have to design a system to track the functioning of Cabs in various part of Bangalore in a traffic hell depicted below.
No alt text provided for this image

?? Such a system would need to update information regarding following:

?? Availability of Cabs in a location

?? Status of Traffic in a location

?? Weather Details of a particular location for pulling money from your pockets

?? Analytics of people dying trying to book a cab

[?] Producers and Consumers in Kafka:

?? In our architecture, we could have a process that reads these data points and updates them to a Queue data structure.

?? Since this process produces data into the queue, let's call it a Producer. Very. Smart. Stuff :P

No alt text provided for this image

?? Once, this data is pushed inside the queue, there are other processes that read these data points from queue and present it in form of a UI to frustrated people (trying to book a cab in Blr) or to internal Noober employees for analytics purposes.

No alt text provided for this image

?? We call these processes Consumers, since they consume the data points from the queue. Again, very smart.

[?] Partitions, Brokers, Topics, Offsets in Kafka:

?? Here comes the inflection point in our architecture. Overtime, our business grows and we start covering more and more features in the same Noober application.

Following 2 issues arise:

?? Our central queue running on 1 server becomes overloaded with damn too many requests and begins to choke.

?? Consumers & Producers also start suffering, since they need to incorporate the sudden failures in the Queue.

At this stage, we would want to distribute our Queue data structure over multiple servers. But how do we do that?

?? One way could be distribute the contents of the queue into multiple queues randomly.

No alt text provided for this image

?? An important problem with randomly distributing the data into multiple queues:

?? Suppose you want analytics for Sarjapur Cab Traffic.

No alt text provided for this image

Me, who could never book a cab in evening from Sarjapur, talking about Analytics.

?? Then, as a consumer process - you'll have to consume the traffic status from queue-1 and cab trip details from queue - 2.

?? Post which, you'll have to write business logic, to merge these results to provide a good customer experience, since data is coming from different queues.

This is unnecessary EXTRA work and Kafka Partition Key feature helps avoid this issue

?? Partition Key feature in Kafka, ensures that all data points with a particular hash value enter the SAME queue / partition.

?? Hence, all messages relevant to Sarjapur location would enter the SAME queue and hence no need to merge results at the consumer end :)

Partitions are ways to group data based on some logical parameter. In our case, it was the hash of the location - Sarjapur.

  • In Kafka, each of these queues is called a Partition.
  • Each server that runs one or more of these queues / partitions, is called a Broker.
  • Group of these queues / partitions that contain similar data, is called a Topic.
  • Each record in a queue / partition is identified using a Sequence Number or an Offset.

No alt text provided for this image

[?] Consumer Groups, Replication, Tiered Storage in Kafka:

?? Now, here comes the parallelism and concurrency bit in Kafka.

?? Since we have distributed the data points into multiple queues in our system, we could run multiple consumers reading the data from different queues.

An important point regarding Concurrency and Parallelism

???? The unit of parallelism in Kafka is the number of Partitions. Hence, if you have 10 partitions in a topic, then you can have 10 consumer processes reading data from these 10 partitions concurrently.

?? Another important concept, is that of a Consumer Group.

?? Imagine that the 2 personas:

  • Customers trying to book a Cab
  • Internal Employees running analytics on data

are trying to consume the SAME data from SAME set of queues/partitions.

? Kafka provides a functionality of putting all consumer processes of Customers into 1 Consumer Group and all consumer processes of Internal Employees into another Consumer Group.

No alt text provided for this image

?? 2 consumer processes in 2 different consumer groups, can have their own tracking of offsets. The red circles are consumers from Red Consumer Group and same for the blue ones.

?? For example: The red arrow signifies, till what position in the queue, a red consumer has read the messages. And the red pointer is completely independent of the blue pointer.

So, the red consumers and blue consumers can consume same messages at their own respective pace hence making Kafka highly scalable in-terms of consuming messages as well.

?? This depicts how easily, Kafka solves the concurrency and parallelism problem using Consumer Groups and associated offset tracking system.

???? Important:

Any 2 consumers belonging to a single Consumer Group, NEVER read from SAME partition. Otherwise, we'd have to apply locks etc. and that would disable Kafka from being a HIGH THROUGHPUT system.

? Another capability of Kafka, is to ensure that the data residing in the queues can be deleted after a configured period of time.

? Confluent introduced a feature called, Tiered Storage that let's you persist data in Kafka across different storage tiers like cheap store (Blob stores), expensive stores (Caches), disks (Append-only-logs on Disks) depending on how long data has resided in the system and it's associated access pattern.

No alt text provided for this image

? Anyway, finally - data inside each partition is replicated across multiple brokers.

So, Kafka keeps multiple copies of same data on different machines, to ensure fault tolerance and durability of the system.
No alt text provided for this image

A replication factor of 3, implies Kafka would keep 3 copies of same data on 3 different brokers.

[?] Leader Election in Kafka (KRaft, Zookeeper):

?? Since, there are multiple brokers that store same copy of a queue, hence at any point of time, there is 1 leader broker (the one, which is most up-to-date wrt to the latest writes to the system).

?? Other brokers are categorised as followers. In situations where the existing leader fails, then one of the followers is promoted to be a leader and leader election algorithms are put to place for the same.

?? There is a very interesting KRaft protocol, that solves this problem. Earlier this job was pursued by Zookeeper. I'll try to document the leader election algorithm via both these ways in future articles.

?? Thankyou very much for making it thus far. Hope you learnt a thing or two about Apache Kafka, if you're a beginner :)

??Please comment if you have any questions or found something incorrect !!

???? Godspeed !!

No alt text provided for this image


Ramesh Babu

Engineer | Writer

2 年

please write article about blocking, non-blocking I/O.. lot of confusion about this..

回复

Wonderful article. You are impeccable . G.OA.T.

Puneet Rajwani

Software Engineering Specialist @ Dassault Systèmes | Full-Stack Developer

2 年

Nikhil Srivastava this is one of the best way I've come through where you explain a concept in such a smooth way that the person who have no knowledge about the concept understand every bit of it with so ease. Kudos to the great work and explanation And thanks for sharing.

Subham Sahu

Engineer | IIT Ropar

2 年

Nikhil Srivastava, thank you for such a nice article ??. It clarified many things and revised a few things I had forgotten. I would appreciate it if you could verify the following: 1. I find Message Queue, Pub/Sub kind of the same. The only difference being the latter supports “multiple consumers” to receive the same message. (and order is also guaranteed). So Pub/Sub is the general case of MQ. 2. It means that Pub/Sub must store the message in the buffer (for a longer time) for all subscribers to read. 3. Kafka is a Pub/Sub (it means it can be used as a message queue). An oversimplification, but I think the following are the USPs of Kafka: 1. It replicates the data across multiple “brokers” (no SPOF). 2. We can have a consumer group (consisting of multiple consumers) read from the same topic & split the workload. (kind of like typical Rabbit MQ). (for max efficiency: Number of consumers = No. of partitions). 3. We can have as many consumer groups as there’s no competition between the two groups. (typical pub/sub) 4. Kafka coordinator must be keeping track of consumers efficiently via health checks etc. (so a USP indeed). Did I infer anything wrong, or would you like to add something? ??? Thank you. ??

Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

? Let's connect over a 1x1 here regarding SWE interviews, Kafka, LLD :?https://topmate.io/nikhil_srivastava ? If you love my style of writing, then subscribe to my newsletter here !! https://www.dhirubhai.net/newsletters/lld-concurrency-hld-6961694694581964800/

要查看或添加评论,请登录

Nikhil Srivastava的更多文章

社区洞察

其他会员也浏览了