登录查看更多内容

Apache Kafka: A Deep Dive into Distributed Event Streaming

Lashman Bala

Data Engineer

发布日期: 2025年3月19日

Introduction

In the era of big data, organizations generate massive amounts of data that need to be processed, stored, and analyzed in real time. Apache Kafka is a distributed event streaming platform designed to handle high-throughput, fault-tolerant, and scalable data streaming. It enables real-time data processing across various use cases, including event logging, messaging, data pipelines, and stream processing.

Kafka combines three key capabilities to implement end-to-end event streaming solutions with a single, battle-tested platform:

Publishing (Writing) and Subscribing (Reading): Continuous import/export of data from other systems.
Durable and Reliable Storage: Stores streams of events for as long as needed.
Stream Processing: Processes events as they occur or retrospectively.

Core Features of Kafka

? High Throughput – Handles millions of messages per second.

? Scalability – Easily scales horizontally by adding brokers.

? Durability – Stores messages in a fault-tolerant way.

? Real-Time Streaming – Low-latency data processing.

? Distributed & Replicated – Ensures high availability.

Kafka Architecture

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers, both on-premise and in cloud environments.

Kafka Components

1?? Servers (Brokers & Connect Nodes)

Kafka runs as a cluster of one or more servers, which can span multiple regions. Some of these servers form the storage layer (brokers), while others run Kafka Connect to continuously import and export data between Kafka and external systems (e.g., relational databases, other Kafka clusters).

2?? Clients (Producers & Consumers)

Clients allow you to build distributed applications that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner.

Producers: Publish (write) events to Kafka.
Consumers: Subscribe to (read and process) events from Kafka.

Kafka ensures full decoupling between producers and consumers, meaning they are independent of each other. This decoupling allows Kafka to scale efficiently without performance degradation.

3?? Events

An event is a record of the fact that "something happened" in the world or in a business process. Kafka stores data as events.

Example Event:

Event Key: "Alice"
Event Value: "Made a payment of $200 to Bob"
Event Timestamp: "Jun. 25, 2020, at 2:06 p.m."

4?? Topics

Events are organized and durably stored in topics. A topic is like a folder in a filesystem, with events as files inside it.

Multi-Producer & Multi-Subscriber: Topics can have multiple producers and consumers.
Data Retention: Kafka does not delete events after consumption; instead, users define retention policies.
Efficient Storage: Kafka can store large volumes of data without performance degradation.

5?? Partitions

Topics are partitioned, meaning a topic is split into multiple "buckets", distributed across different brokers.

Enables parallel processing – Clients can read/write data simultaneously.
Event Ordering – Kafka guarantees that events with the same key (e.g., customer ID) are always written to the same partition.

6?? Replication

Kafka ensures fault tolerance by replicating data across multiple brokers.

Replication Factor: Defines how many copies of each partition exist.
Leader-Follower Model: Each partition has a leader and follower replicas.
Failover Handling: If a broker fails, Kafka elects a new leader.
Recommended Setting: replication-factor=3 (3 copies of each partition for redundancy).

7?? Offsets

Kafka maintains an offset for each message in a partition. An offset is a unique identifier assigned to each message, indicating its position within the partition.

Consumers track offsets to ensure messages are processed correctly.
Committed Offsets: Consumers periodically commit their processed offsets to Kafka to avoid reprocessing messages in case of failure.
Auto-Offset Reset: If an offset is unavailable (e.g., due to log retention limits), consumers can be configured to reset the offset to the earliest or latest available message.

Offsets enable Kafka to provide at-least-once or exactly-once message processing guarantees, ensuring reliable data delivery.

Creating a kafka topic:

bin/kafka-topics.sh  \
     -- create --topic test-topic \
     -- partitions 3 \
     -- replication-factor 2 \
     -- bootstrap-server localhost:9092

Creating a Producer:

bin/kafka-console-producer.sh \
    -- topic test-topic \
    -- bootstrap-server localhost:9092

Creating a Consumer:

领英推荐

Lithium: Dynamic, Self Hosted, and Distributed…

Niraj Mishra 7 个月前

Apache Hudi - The Streaming Data Lake Platform

Vinoth Chandar 3 年前

Kafka for Data Engineers

Ankur Ranjan 2 年前

bin/kafka-console-consumer.sh \
    -- topic test-topic \
    -- partition 0 \
    -- offset earliest \
    -- bootstrap-server localhost:9092

How Kafka Works?

1?? Producers send events to Kafka topics.

2?? Events are distributed across partitions (based on keys or round-robin allocation).

3?? Kafka Brokers store and manage messages durably.

4?? Consumers read data from topics using consumer groups.

5?? Messages are processed in real-time using Kafka Streams, Spark Streaming, or Flink.

Kafka Guarantees & Fault Tolerance

? Message Durability – Events are stored persistently for a defined period.

? At-Least-Once Delivery – Messages are delivered at least once, avoiding data loss.

? Exactly-Once Processing – Kafka Streams ensures idempotent message processing.

? Automatic Recovery – If a broker crashes, Kafka automatically recovers lost data from replicas.

Kafka Messaging Models

1?? Publish-Subscribe Model

Multiple consumers can subscribe to the same topic.
Each consumer receives all messages from the topic.

2?? Consumer Group Model (Load Balancing)

Each partition is assigned to one consumer in a consumer group.
Enables parallel processing across multiple consumers.

Kafka Topic Partitioning & Replication

Kafka splits a topic into partitions, allowing for parallel consumption and scalability.

Partitioning Strategy: Messages with the same key go to the same partition.
Replication ensures that data remains available even if a broker fails.
Leader Election: If the leader broker crashes, Kafka automatically assigns a new leader.

Choosing the Right Number of Partitions:

? More partitions = Better parallelism, but higher resource consumption.

? Ideally, number of partitions >= number of consumers in a group.

Zookeeper & Kraft Mode

Zookeeper in Kafka

Apache Kafka traditionally relies on Apache ZooKeeper for metadata management, leader election, and broker coordination. ZooKeeper maintains the state of Kafka brokers and ensures fault tolerance by:

Tracking available brokers.
Managing partition leadership and failover.
Storing configurations and access control lists (ACLs).

KRaft (Kafka Raft) Mode

KRaft (Kafka Raft) is a ZooKeeper-less mode introduced to make Kafka more self-sufficient. KRaft replaces ZooKeeper by implementing its own Raft-based consensus algorithm.

Benefits of KRaft:

? Simplified Deployment – No need for an external ZooKeeper cluster.

? Faster Metadata Management – Improved scalability and resilience.

? Better Fault Tolerance – No single point of failure.

? Higher Throughput – Reduces communication overhead between Kafka and ZooKeeper.

Starting from Kafka 3.0, KRaft is production-ready and is expected to fully replace ZooKeeper in future releases.

Note: Apache Kafka 4.0 only supports KRaft mode. ZooKeeper mode has been removed.

Conclusion

Apache Kafka is a powerful, scalable, and fault-tolerant event streaming platform. It is widely used in big data architectures to support high-throughput messaging systems, log aggregation, real-time analytics, microservices, and event-driven architectures.

If you’re building real-time streaming pipelines, Apache Kafka is the go-to solution!

要查看或添加评论，请登录

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

2025年3月25日

AWS S3: Ultimate Guide to Simple Storage Service

Introduction to S3 Amazon Simple Storage Service (S3) is a scalable, high-speed, low-cost object storage service…
Databricks: The Unified Data Analytics Platform

2025年3月24日

Databricks: The Unified Data Analytics Platform

Introduction In the era of big data and AI, businesses need scalable, unified, and cost-efficient platforms to handle…

1 条评论
DBT : A Comprehensive Guide to Data Build Tool

2025年3月22日

DBT : A Comprehensive Guide to Data Build Tool

Introduction to dbt Modern data teams need efficient ways to transform raw data into meaningful insights. dbt (Data…

1 条评论
Delta Lake: An Open Table Format for Reliable Lakehouse architecture

2025年3月21日

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

The explosion of big data has led to a growing need for efficient, scalable, and reliable data management solutions…

1 条评论
Understanding Apache Airflow: A Comprehensive Guide

2025年3月20日

Understanding Apache Airflow: A Comprehensive Guide

Apache Airflow is a powerful open-source platform used for automating, scheduling, and monitoring complex workflows…
Apache Spark Structured Streaming

2025年3月6日

Apache Spark Structured Streaming

Introduction Apache Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of…
Apache Spark: The Ultimate Big Data Processing Engine

2025年3月4日

Apache Spark: The Ultimate Big Data Processing Engine

1. Introduction to Apache Spark What is Apache Spark? Apache Spark is a lightning-fast, distributed computing framework…

1 条评论
Apache Hive: A Data Warehouse Solution on Hadoop

2025年2月28日

Apache Hive: A Data Warehouse Solution on Hadoop

Introduction Apache Hive is a data warehouse infrastructure built on top of Hadoop that allows users to query and…
Apache YARN: The Resource Manager for Hadoop Ecosystem

2025年2月26日

Apache YARN: The Resource Manager for Hadoop Ecosystem

Introduction Apache YARN (Yet Another Resource Negotiator) is the cluster resource management layer in Hadoop…
Understanding HDFS: The Backbone of Big Data Processing

2025年2月25日

Understanding HDFS: The Backbone of Big Data Processing

In today’s data-driven world, the ability to store and process vast amounts of data efficiently is critical. This is…

See all articles

Apache Kafka: A Deep Dive into Distributed Event Streaming

Lashman Bala

Data Engineer

Introduction

Core Features of Kafka

Kafka Architecture

Kafka Components

1?? Servers (Brokers & Connect Nodes)

2?? Clients (Producers & Consumers)

3?? Events

Example Event:

4?? Topics

5?? Partitions

6?? Replication

7?? Offsets

领英推荐

How Kafka Works?

Kafka Guarantees & Fault Tolerance

Kafka Messaging Models

1?? Publish-Subscribe Model

2?? Consumer Group Model (Load Balancing)

Kafka Topic Partitioning & Replication

Zookeeper & Kraft Mode

Zookeeper in Kafka

KRaft (Kafka Raft) Mode

Conclusion

Lashman Bala的更多文章

社区洞察

其他会员也浏览了

Apache Kafka

Seamless Data Streaming: How to Integrate Kafka with Node.js for Real-Time Applications

Boost Real-time Processing with Spark Structured Streaming

5 Must-Know Distributed Systems Design Patterns for Event-Driven Architectures

Real-Time Data Streaming Simplified with Apache Kafka

Observability Challenges in Kafka Multi-Tenant Architectures

Spark Structured Streaming

Top 5 Open source monitoring tools for Kubernetes

Why doesn't Netflix crash? . . . Data processing in motion

Resource Optimization for Streaming Data Preprocessing in Kafka

Introduction

Core Features of Kafka

Kafka Architecture

Kafka Components

1?? Servers (Brokers & Connect Nodes)

2?? Clients (Producers & Consumers)

3?? Events

Example Event:

4?? Topics

5?? Partitions

6?? Replication

7?? Offsets

领英推荐

How Kafka Works?

Kafka Guarantees & Fault Tolerance

Kafka Messaging Models

1?? Publish-Subscribe Model

2?? Consumer Group Model (Load Balancing)

Kafka Topic Partitioning & Replication

Zookeeper & Kraft Mode

Zookeeper in Kafka

KRaft (Kafka Raft) Mode

Conclusion

Lashman Bala的更多文章

AWS S3: Ultimate Guide to Simple Storage Service

Databricks: The Unified Data Analytics Platform

DBT : A Comprehensive Guide to Data Build Tool

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

Understanding Apache Airflow: A Comprehensive Guide

Apache Spark Structured Streaming

Apache Spark: The Ultimate Big Data Processing Engine

Apache Hive: A Data Warehouse Solution on Hadoop

Apache YARN: The Resource Manager for Hadoop Ecosystem

Understanding HDFS: The Backbone of Big Data Processing

社区洞察

其他会员也浏览了

Apache Kafka

Seamless Data Streaming: How to Integrate Kafka with Node.js for Real-Time Applications

Boost Real-time Processing with Spark Structured Streaming

5 Must-Know Distributed Systems Design Patterns for Event-Driven Architectures

Real-Time Data Streaming Simplified with Apache Kafka

Observability Challenges in Kafka Multi-Tenant Architectures

Spark Structured Streaming

Top 5 Open source monitoring tools for Kubernetes

Why doesn't Netflix crash? . . . Data processing in motion

Resource Optimization for Streaming Data Preprocessing in Kafka