Learn Kafka In Just 5 minutes
Shrey Batra
CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)
The topic of Apache Kafka is hot these days...! But why is it so? What is Kafka, where do we use it and what's the benefits? Let's look at all this in a short 5 min article..! If you like my articles, do subscribe and share my newsletter with your friends..! ????
Yes, Kafka was originally made by LinkedIn, later open sourced as Apache Kafka.
Apache Kafka - The Elon Musk of Streaming Systems
Yup, you heard it right..! Apache Kafka is an open source, distributed "event streaming" platform (or datastore) used in high throughput systems. Let's break this in easy terms -
We all have heard about a queue in our Algo DS lectures, where we have a long array like pipe where we can put data from one end, and it pops out data from another end in the order we pushed. Simple FIFO - first in, first out.
Don't tell me it's that simple..!
Similarly, let's say we have a Kafka topic - a queue like type of database, in which we can insert data in an order (1, 2, 3...) and you can then read that data in the same order (1, 2, 3...). Let's just assume a single partition, very basic configuration just to start understanding.
Now, how is Kafka different from a simple Queue? Kafka persists data it has in its "queue like" structure, which means it actually stores it on Disk and not just the RAM, so that there is no data loss.
Why is this Queue like Datastore important?
Now that we know the (very very) basic working of Kafka, we can now say that a client can query or read the data from Kafka - it's a datastore right? Now, essentially you would think that you'd fire some SQL like query and can apply filters, but no. Let's come back to basics of queue system - you read the data one by one, in FIFO manner.
The difference to an ordinary queue vs Kafka is that you don't remove the data after you have read it. Instead, each consumer (or client) reading the data from Kafka maintain's an offset that says, how much data have I read and from where should I start reading it again.
Let's ignore "queue like" terminology and replace it with WAL - Write Ahead Log - meaning we write each event in an append only file.
Multiple Clients on same Kafka Topic
Now, as we have data consistently stored in our Kafka Topic (or the queue like structure), multiple consumers (or clients) can read data from this topic, each maintaining its own offset. Now this is one of the most powerful feature of Kafka -
Using Kafka, you can push a single event to multiple services, as each service can read the same Kafka Topic.
Kafka as a Distributed System
Simply saying, how do you define a distributed system? And why? We define a distributed system, where we can break our data store / system into smaller, similar looking "chunks" or partitions, helping us to scale our system for huge volume of data or requests.
If you see our basic Kafka design, we maintain a single WAL store, which simple could be like an append only file (writing each event/message in a new line on file). Now, let's say your Producer - service pushing events into the system - produces 50 events every second, whereas your consumer - the service who reads the events and does computation - reads only 10 events per second.
As you'll see, with growing time, your consumer will start to lag behind and there would be a huge delay as when an event is pushed to Kafka, and when it is read. Hence, what is we can have multiple consumers reading from the Kafka queue, scaling our systems? But how do we know if one consumer has read the message, so that other consumer can read next? Remember FIFO logic..!!
To solve this, we break our single WAL (or queue like) system into multiple WAL (or queues) and distribute our events across each of this new partition (smaller queue). Each partition now behaves like a FIFO queue or WAL and we can have those many number of consumers reading from Kafka as the number of partitions..! (Max of 1 consumer per partition), making us break our original Kafka topic (50 messages per second) into 5 partitions (each having 10 msg/sec) and 5 consumers (each reading 10 msg/sec) from their own partition.
And that was Kafka in 5 minutes my friends..!!
Conclusion
Yes, there is lots of technical information and many more details in this, but the initial explanation could not be more simple than this..! Hope you liked my article and follow my newsletter to get notified with my articles..! Do like this post ????
You can now also download my Eazy Develop app, where you can read mine as well as hand picked articles from various awesome authors and tech blogs from companies like Google, LinkedIn, Meta, etc.
Engineering Manager - Java/Microservices/DevOps/Agile stack
1 年Nice explanation.
Elastic Beanstalk SME. AWS PS Deployment CSE-2. 3x AWS Certified DevOps Professional
2 年Love this
<>Senior Software Engineer(SDE3)@PayU | Java | Spring Boot | GoLang | PHP | JavaScript | SIH'19 | RKGITian | BASILian |Tech Enthusiast </>
2 年This article is a gem. One of the simplest way to explain Kafka??
Senior Software Engineer at Atlassian | Ex-Cisco | BITSian
2 年Wonderful Article! I especially loved that part where you clearly explained how Kafka being a message broker is different from message queues.
I help insurers to build digital & data driven solutions | Analytics & Insights | ML & AI | HealthTech & InsureTech | Speaker & Author | Thought Leadership & Mentoring |
2 年Genuine insights Shrey Batra ?????