Apache Kafka: A Deep Dive into Distributed Event Streaming

Apache Kafka: A Deep Dive into Distributed Event Streaming

Introduction

In the era of big data, organizations generate massive amounts of data that need to be processed, stored, and analyzed in real time. Apache Kafka is a distributed event streaming platform designed to handle high-throughput, fault-tolerant, and scalable data streaming. It enables real-time data processing across various use cases, including event logging, messaging, data pipelines, and stream processing.

Kafka combines three key capabilities to implement end-to-end event streaming solutions with a single, battle-tested platform:

  1. Publishing (Writing) and Subscribing (Reading): Continuous import/export of data from other systems.
  2. Durable and Reliable Storage: Stores streams of events for as long as needed.
  3. Stream Processing: Processes events as they occur or retrospectively.

Core Features of Kafka

? High Throughput – Handles millions of messages per second.

? Scalability – Easily scales horizontally by adding brokers.

? Durability – Stores messages in a fault-tolerant way.

? Real-Time StreamingLow-latency data processing.

? Distributed & Replicated – Ensures high availability.


Kafka Architecture

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers, both on-premise and in cloud environments.

Kafka Components

1?? Servers (Brokers & Connect Nodes)

Kafka runs as a cluster of one or more servers, which can span multiple regions. Some of these servers form the storage layer (brokers), while others run Kafka Connect to continuously import and export data between Kafka and external systems (e.g., relational databases, other Kafka clusters).

2?? Clients (Producers & Consumers)

Clients allow you to build distributed applications that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner.

  • Producers: Publish (write) events to Kafka.
  • Consumers: Subscribe to (read and process) events from Kafka.

Kafka ensures full decoupling between producers and consumers, meaning they are independent of each other. This decoupling allows Kafka to scale efficiently without performance degradation.

3?? Events

An event is a record of the fact that "something happened" in the world or in a business process. Kafka stores data as events.

Example Event:

  • Event Key: "Alice"
  • Event Value: "Made a payment of $200 to Bob"
  • Event Timestamp: "Jun. 25, 2020, at 2:06 p.m."

4?? Topics

Events are organized and durably stored in topics. A topic is like a folder in a filesystem, with events as files inside it.

  • Multi-Producer & Multi-Subscriber: Topics can have multiple producers and consumers.
  • Data Retention: Kafka does not delete events after consumption; instead, users define retention policies.
  • Efficient Storage: Kafka can store large volumes of data without performance degradation.

5?? Partitions

Topics are partitioned, meaning a topic is split into multiple "buckets", distributed across different brokers.

  • Enables parallel processing – Clients can read/write data simultaneously.
  • Event Ordering – Kafka guarantees that events with the same key (e.g., customer ID) are always written to the same partition.

6?? Replication

Kafka ensures fault tolerance by replicating data across multiple brokers.

  • Replication Factor: Defines how many copies of each partition exist.
  • Leader-Follower Model: Each partition has a leader and follower replicas.
  • Failover Handling: If a broker fails, Kafka elects a new leader.
  • Recommended Setting: replication-factor=3 (3 copies of each partition for redundancy).

7?? Offsets

Kafka maintains an offset for each message in a partition. An offset is a unique identifier assigned to each message, indicating its position within the partition.

  • Consumers track offsets to ensure messages are processed correctly.
  • Committed Offsets: Consumers periodically commit their processed offsets to Kafka to avoid reprocessing messages in case of failure.
  • Auto-Offset Reset: If an offset is unavailable (e.g., due to log retention limits), consumers can be configured to reset the offset to the earliest or latest available message.

Offsets enable Kafka to provide at-least-once or exactly-once message processing guarantees, ensuring reliable data delivery.


Creating a kafka topic:

bin/kafka-topics.sh  \
     -- create --topic test-topic \
     -- partitions 3 \
     -- replication-factor 2 \
     -- bootstrap-server localhost:9092        

Creating a Producer:

bin/kafka-console-producer.sh \
    -- topic test-topic \
    -- bootstrap-server localhost:9092        

Creating a Consumer:

bin/kafka-console-consumer.sh \
    -- topic test-topic \
    -- partition 0 \
    -- offset earliest \
    -- bootstrap-server localhost:9092        

How Kafka Works?

1?? Producers send events to Kafka topics.

2?? Events are distributed across partitions (based on keys or round-robin allocation).

3?? Kafka Brokers store and manage messages durably.

4?? Consumers read data from topics using consumer groups.

5?? Messages are processed in real-time using Kafka Streams, Spark Streaming, or Flink.


Kafka Guarantees & Fault Tolerance

? Message Durability – Events are stored persistently for a defined period.

? At-Least-Once Delivery – Messages are delivered at least once, avoiding data loss.

? Exactly-Once Processing – Kafka Streams ensures idempotent message processing.

? Automatic Recovery – If a broker crashes, Kafka automatically recovers lost data from replicas.


Kafka Messaging Models

1?? Publish-Subscribe Model

  • Multiple consumers can subscribe to the same topic.
  • Each consumer receives all messages from the topic.

2?? Consumer Group Model (Load Balancing)

  • Each partition is assigned to one consumer in a consumer group.
  • Enables parallel processing across multiple consumers.


Kafka Topic Partitioning & Replication

Kafka splits a topic into partitions, allowing for parallel consumption and scalability.

  • Partitioning Strategy: Messages with the same key go to the same partition.
  • Replication ensures that data remains available even if a broker fails.
  • Leader Election: If the leader broker crashes, Kafka automatically assigns a new leader.

Choosing the Right Number of Partitions:

? More partitions = Better parallelism, but higher resource consumption.

? Ideally, number of partitions >= number of consumers in a group.


Zookeeper & Kraft Mode

Zookeeper in Kafka

Apache Kafka traditionally relies on Apache ZooKeeper for metadata management, leader election, and broker coordination. ZooKeeper maintains the state of Kafka brokers and ensures fault tolerance by:

  • Tracking available brokers.
  • Managing partition leadership and failover.
  • Storing configurations and access control lists (ACLs).

KRaft (Kafka Raft) Mode

KRaft (Kafka Raft) is a ZooKeeper-less mode introduced to make Kafka more self-sufficient. KRaft replaces ZooKeeper by implementing its own Raft-based consensus algorithm.

Benefits of KRaft:

? Simplified Deployment – No need for an external ZooKeeper cluster.

? Faster Metadata Management – Improved scalability and resilience.

? Better Fault Tolerance – No single point of failure.

? Higher Throughput – Reduces communication overhead between Kafka and ZooKeeper.

Starting from Kafka 3.0, KRaft is production-ready and is expected to fully replace ZooKeeper in future releases.

Note: Apache Kafka 4.0 only supports KRaft mode. ZooKeeper mode has been removed.


Conclusion

Apache Kafka is a powerful, scalable, and fault-tolerant event streaming platform. It is widely used in big data architectures to support high-throughput messaging systems, log aggregation, real-time analytics, microservices, and event-driven architectures.

If you’re building real-time streaming pipelines, Apache Kafka is the go-to solution!

要查看或添加评论,请登录

Lashman Bala的更多文章

社区洞察

其他会员也浏览了