登录查看更多内容

Apache Kafka Core Concepts

Said Naeem Shah

14+ Years | Backend Java Engineer | Solution Architect | Spring Boot | Microservices | Kafka | Cloud | Event & Domain-Driven | Kubernetes | SQL & NoSQL | Clean Code | Clean Architecture

发布日期: 2024年7月21日

Apache Kafka

Kafka is a streaming platform. It is a distributed system, which means you may add any numbers of Kafka nodes to the cluster (Horizontal Scaling). Kafka keeps messages for a period of time (default is 7 days) after they've been received.

Kafka Components

Kafka cluster: A Kafka cluster refers to a group of one or more Kafka brokers (servers) working together to manage and store data streams.

Broker: A Kafka broker is a server that stores and manages data streams.

ZooKeeper: A naming registry, which is used for Kafka cluster orchestration and management. When a broker or topic is added or removed, zookeeper notifies all nodes. Also helps in keeping track of the health and status of the Kafka cluster. ZooKeeper is also used for leader election in Kafka. Each partition in a Kafka topic has one leader and ZooKeeper helps in electing leaders and notifying other brokers about leader changes. ZooKeeper is also used for managing consumer group coordination.

Kraft (KIP-500): Apache Kafka Raft (KRaft) aims to remove dependency of Zookeeper. KRaft is more scalable and faster than ZooKeeper mode, allowing Kafka clusters to handle more traffic and data.

Topic: A topic is a category or feed name to which records (messages) are published/subscribed.

Partition: A Kafka topic is divided into partitions. Producers send messages to specific partitions, and consumers can subscribe to one or more partitions of a topic. Partition allows consumers to read from the same topic at the same time. If producer don’t specify the partition id then it’s calculated as: partition id = Key % Number of Partitions. If the key is also not specified then Round robin fashion is used. If consumers are added or removed from a consumer group, Kafka dynamically rebalances the partition assignments to ensure load distribution. Note that consumer membership is detected by heartbeats mechanism. Benefit: Parallel processing

Replica: Each partition can have one or more replicas, and these replicas are distributed across different brokers. One of the replicas is designated as the "leader," and the others are "followers." The leader handles all reads and writes for the partition, while the followers replicate the data from the leader. If a broker (containing a leader) is going down, one of the replicas can be promoted to the new leader, ensuring continuous availability of the data. The number of replicas for a partition is configurable, and having more replicas improves fault tolerance but also increases the storage and network requirements. Benefit: Fault Tolerance

Consumer Group: A logical grouping of consumers that work together to consume and process messages from one or more topics. A message published to a topic is consumed by only one consumer within a consumer group.

@Bean
public NewTopic myTopic() {
  return TopicBuilder.name("orders.creations.v1")
      .partitions(3)
 .replicas(3)
      .build();
}

Publishing messages:

kafkaTemplate.send(ordersCreationTopic, request.getKey(), request);?

Java Doc:

send(topicName, key, request);
send(topicName, partition, key, request);

Consuming messages:

@KafkaListener(topics = { "ordersCreationTopic" }, groupId = "${spring.kafka.consumer.group-id}")  
public void listenToTopic(String message) {}

Kafka Core APIs

Producer API: To publish streams into topic

Consumer API: Used to consume streams from topic

Streams API: Used for real-time stream processing. It simplifies the development of stream processing logic.

Connect API: Used for connecting Kafka with external systems. Kafka connectors enable the movement of data in and out of Kafka topics, connecting Kafka to external databases, storage systems, messaging systems and other data platforms.

Kafka schema registry

Schema registry is used to manage the schemas for the messages exchanged between producers and consumers. Messages are serialized before being sent to Kafka topics and they need to be deserialized by consumers to be processed. The Schema Registry is used to ensure that the schema used by the consumer and the schema used by the producer are identical.

Sample JSON based schema:

spring.kafka.producer.key-serializer = org.apache.kafka.common.serialization.StringSerializer

?(This means Key be will as a string)

spring.kafka.producer.value-serializer = org.springframework.kafka.support.serializer.JsonSerializer

(This means that the values of your Kafka messages are expected to be objects that will be serialized to JSON format using the JsonSerializer)

spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer

Performance Optimization:

Improving Kafka performance involves multiple configuration like optimize hardware, cluster configuration, topic partitions and replicas configuration, producer and consumer configuration, network and JVM configuration etc.

领英推荐

A Guide To Apache Kafka - A Data Streaming Platform

Decipher Zone Technologies Pvt Ltd 2 年前

Real-Time Data Streaming Simplified with Apache Kafka

Codingmart Technologies 3 个月前

Kafka Alternatives

Macrometa 2 年前

Kafka Security:

Below measurements are used for security:

Authentication: Kafka supports various authentication mechanisms such as PLAIN, SSL, and SASL (Simple Authentication and Security Layer).

Properties configuration:

spring.kafka.bootstrap-servers=[BROKER_URL_HERE]
spring.kafka.properties.sasl.mechanism=PLAIN
spring.kafka.properties.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule   required username=[USER_NAME_HERE]  
password=[PASSWORD_NAME_HERE]
spring.kafka.properties.security.protocol=SASL_SSL (This property specifies the security protocol to be used for communication between Kafka clients and brokers. It indicates the use of SASL for authentication and SSL for encryption)

Authorization: Validates whether the publisher has write access to publish data or not.

Encryption:?All communications between the brokers and clients are encrypted. Also provide option for the messages hashing like HMAC-SHA

SSL Certificates: Uses SSL certificates for secure communication.

Kafka Monitoring:

Confluent Control Center: Confluent Control Center is a commercial web-based tool for managing and monitoring Kafka clusters that allows users to view the performance, health, and consumption of brokers, topics, partitions, producers, consumers, and connectors. ?It also allows users to set up SLAs, alerts, and anomaly detection for their Kafka cluster.

Kafka Metrics: You can use tools like JConsole, JVisualVM, or any other JMX-compatible monitoring tools to connect to the Kafka brokers and view important metrics.

KSQLDB: KSQLDB allows you to define streams and tables over the data in Kafka topics, and you can then run continuous queries on these streams and tables.

Questions:

How Kafka maintains data consistency?

1.????? Kafka takes care of message order, consumers read all messages in the order they were published. Order is maintained via offset.

2.????? When a producer publish a record to a partition, the leader ensures that the record is replicated to the follower replicas, after that acknowledgement is sent to Kafka. This make sure that all follower replicas are in-sync replica. Producers can configure the value for "acks" is either 0, 1 or all. Default value is 1, which means the producer waits for acknowledgment from the leader broker only.

3.????? Kafka supports transactional ACID feature.

4.????? Each partition of a topic have corresponding log file, where the data is appended. e.g. if the topic named "my-topic" has two partitions, it will create two logs file named: my_topic-0 and my_topic-1

5.????? Kafka nature is: A message published to a topic is consumed by only one consumer within a consumer group.

How Publisher knows that the message is successfully published?

The publisher waits for the acknowledgment according to property value of ack=0, ack=1 or ack=all.

How Kafka knows that the message is successfully subscribed?

By default the consumer auto-commit it true. Like

spring.kafka.consumer.enable-auto-commit=true

So the offset is committed auto once the message is read successfully.

What are the advantages of Consumer group?

Parallel Processing: Consumer groups enable parallel processing of messages. Each consumer within a group can process messages independently.

Scalability: Consumer groups support horizontal scalability. As the load increases, you can add more consumers to a consumer group to distribute the processing workload.

Fault Tolerance: If one consumer fails or goes down, other consumers in the same group can continue processing messages.

Message semantics in Kafka?

Kafka offers three different semantics for message processing: "At Most Once", "At Least Once" and "Exactly Once".

Exactly once: This is the preferred behavior, each message is delivered once and only once. Messages are never lost or read twice even if some part of the system fails. Achieving "Exactly Once" semantics is challenging in distributed systems due to the potential for failures, retries, and reprocessing.

At most once: Messages are delivered once, and if there is a system failure, messages may be lost and are not redelivered.

At least once: This means messages are delivered one or more times. If there is a system failure, messages are never lost, but they may be delivered more than once.

How to achieve “exactly once” Semantics?

Idempotent Producers: Producers are configured to be idempotent. If a producer sends a message multiple times due to retries or failures, the message will be written to the topic only once. When you enable idempotence for a Kafka producer, the producer assigns a sequence number to each message it sends and includes it in the message metadata. This sequence number allows the broker to detect and eliminate duplicate messages during the write process.

enable.idempotence: "true"

Exactly once semantic can also be achieved at code level at publisher and consumer side.

要查看或添加评论，请登录

Said Naeem Shah的更多文章

The Unexplored and Hidden Potential of Elasticsearch

2024年8月24日

The Unexplored and Hidden Potential of Elasticsearch

Introduction Elasticsearch is an open-source, distributed search and analytics engine based on the Apache Lucene…
AI-Driven Development Model

2024年8月10日

AI-Driven Development Model

What is AI-Driven Development? AI-driven development is the integration of Artificial Intelligence into the software…

1 条评论
Low-Code and No-Code Development Model

2024年8月10日

Low-Code and No-Code Development Model

Introduction to Low-Code and No-Code Development Low-Code Development: Low-code platforms are designed to minimize…
Cloud Architecture vs Serverless Architecture

2024年8月2日

Cloud Architecture vs Serverless Architecture

Cloud Architecture and Serverless Architecture are two different ways to build and deploy applications using cloud…
Zero Trust Architecture: Benefits, Characteristics, and Challenges

2024年7月28日

Zero Trust Architecture: Benefits, Characteristics, and Challenges

In today's digital age, where cyber threats are constantly evolving, traditional security measures are no longer…
Cloud-Native Architecture: Benefits, Characteristics and Challenges

2024年7月27日

Cloud-Native Architecture: Benefits, Characteristics and Challenges

Introduction Cloud-native architecture is a way of building and running applications that fully uses the advantages of…
Top 6 Famous Architectural Styles

2024年7月26日

Top 6 Famous Architectural Styles

Architectural styles and patterns provide blueprints for system design and implementation. They address various…
Understanding Component Design Principles

2024年7月25日

Understanding Component Design Principles

Component design principles are guidelines that help developers and architects to create modular, maintainable, and…
Microservices Anti-Patterns

2024年7月23日

Microservices Anti-Patterns

Microservices architecture offers significant advantages, including scalability, flexibility, and faster development…
Microservices Patterns: Principles and Design Solutions

2024年7月22日

Microservices Patterns: Principles and Design Solutions

Microservices architecture is built on the following principles: Scalability Availability Resiliency Independence and…

2 条评论

See all articles

Apache Kafka Core Concepts

Said Naeem Shah

14+ Years | Backend Java Engineer | Solution Architect | Spring Boot | Microservices | Kafka | Cloud | Event & Domain-Driven | Kubernetes | SQL & NoSQL | Clean Code | Clean Architecture

领英推荐

Said Naeem Shah的更多文章

社区洞察

其他会员也浏览了

Why doesn't Netflix crash? . . . Data processing in motion

Kafka Connector

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?

How partition and consumer groups works in kafka internally and how a consumer determines which partitions to read from?

Apache Kafka

Kafka

Challenges of starting with Kafka

Internal Architecture of Kafka

Decoding RabbitMQ: Enhancing Communication in Distributed Systems with Message Brokers

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

领英推荐

Said Naeem Shah的更多文章

The Unexplored and Hidden Potential of Elasticsearch

AI-Driven Development Model

Low-Code and No-Code Development Model

Cloud Architecture vs Serverless Architecture

Zero Trust Architecture: Benefits, Characteristics, and Challenges

Cloud-Native Architecture: Benefits, Characteristics and Challenges

Top 6 Famous Architectural Styles

Understanding Component Design Principles

Microservices Anti-Patterns

Microservices Patterns: Principles and Design Solutions

社区洞察

其他会员也浏览了

Why doesn't Netflix crash? . . . Data processing in motion

Kafka Connector

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?

How partition and consumer groups works in kafka internally and how a consumer determines which partitions to read from?

Apache Kafka

Kafka

Challenges of starting with Kafka

Internal Architecture of Kafka

Decoding RabbitMQ: Enhancing Communication in Distributed Systems with Message Brokers

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming