Apache Kafka Core Concepts
Said Naeem Shah
14+ Years | Backend Java Engineer | Solution Architect | Spring Boot | Microservices | Kafka | Cloud | Event & Domain-Driven | Kubernetes | SQL & NoSQL | Clean Code | Clean Architecture
Apache Kafka
Kafka is a streaming platform. It is a distributed system, which means you may add any numbers of Kafka nodes to the cluster (Horizontal Scaling). Kafka keeps messages for a period of time (default is 7 days) after they've been received.
Kafka Components
Kafka cluster: A Kafka cluster refers to a group of one or more Kafka brokers (servers) working together to manage and store data streams.
Broker: A Kafka broker is a server that stores and manages data streams.
ZooKeeper: A naming registry, which is used for Kafka cluster orchestration and management. When a broker or topic is added or removed, zookeeper notifies all nodes. Also helps in keeping track of the health and status of the Kafka cluster. ZooKeeper is also used for leader election in Kafka. Each partition in a Kafka topic has one leader and ZooKeeper helps in electing leaders and notifying other brokers about leader changes. ZooKeeper is also used for managing consumer group coordination.
Kraft (KIP-500): Apache Kafka Raft (KRaft) aims to remove dependency of Zookeeper. KRaft is more scalable and faster than ZooKeeper mode, allowing Kafka clusters to handle more traffic and data.
Topic: A topic is a category or feed name to which records (messages) are published/subscribed.
Partition: A Kafka topic is divided into partitions. Producers send messages to specific partitions, and consumers can subscribe to one or more partitions of a topic. Partition allows consumers to read from the same topic at the same time. If producer don’t specify the partition id then it’s calculated as: partition id = Key % Number of Partitions. If the key is also not specified then Round robin fashion is used. If consumers are added or removed from a consumer group, Kafka dynamically rebalances the partition assignments to ensure load distribution. Note that consumer membership is detected by heartbeats mechanism. Benefit: Parallel processing
Replica: Each partition can have one or more replicas, and these replicas are distributed across different brokers. One of the replicas is designated as the "leader," and the others are "followers." The leader handles all reads and writes for the partition, while the followers replicate the data from the leader. If a broker (containing a leader) is going down, one of the replicas can be promoted to the new leader, ensuring continuous availability of the data. The number of replicas for a partition is configurable, and having more replicas improves fault tolerance but also increases the storage and network requirements. Benefit: Fault Tolerance
Consumer Group: A logical grouping of consumers that work together to consume and process messages from one or more topics. A message published to a topic is consumed by only one consumer within a consumer group.
?
@Bean
public NewTopic myTopic() {
return TopicBuilder.name("orders.creations.v1")
.partitions(3)
.replicas(3)
.build();
}
Publishing messages:
kafkaTemplate.send(ordersCreationTopic, request.getKey(), request);?
Java Doc:
send(topicName, key, request);
send(topicName, partition, key, request);
?
Consuming messages:
@KafkaListener(topics = { "ordersCreationTopic" }, groupId = "${spring.kafka.consumer.group-id}")
public void listenToTopic(String message) {}
?
Kafka Core APIs
Producer API: To publish streams into topic
Consumer API: Used to consume streams from topic
Streams API: Used for real-time stream processing. It simplifies the development of stream processing logic.
Connect API: Used for connecting Kafka with external systems. Kafka connectors enable the movement of data in and out of Kafka topics, connecting Kafka to external databases, storage systems, messaging systems and other data platforms.
Kafka schema registry
Schema registry is used to manage the schemas for the messages exchanged between producers and consumers. Messages are serialized before being sent to Kafka topics and they need to be deserialized by consumers to be processed. The Schema Registry is used to ensure that the schema used by the consumer and the schema used by the producer are identical.
Sample JSON based schema:
spring.kafka.producer.key-serializer = org.apache.kafka.common.serialization.StringSerializer
?(This means Key be will as a string)
spring.kafka.producer.value-serializer = org.springframework.kafka.support.serializer.JsonSerializer
(This means that the values of your Kafka messages are expected to be objects that will be serialized to JSON format using the JsonSerializer)
spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer
?
Performance Optimization:
Improving Kafka performance involves multiple configuration like optimize hardware, cluster configuration, topic partitions and replicas configuration, producer and consumer configuration, network and JVM configuration etc.
领英推荐
Kafka Security:
Below measurements are used for security:
Authentication: Kafka supports various authentication mechanisms such as PLAIN, SSL, and SASL (Simple Authentication and Security Layer).
Properties configuration:
spring.kafka.bootstrap-servers=[BROKER_URL_HERE]
spring.kafka.properties.sasl.mechanism=PLAIN
spring.kafka.properties.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username=[USER_NAME_HERE]
password=[PASSWORD_NAME_HERE]
spring.kafka.properties.security.protocol=SASL_SSL (This property specifies the security protocol to be used for communication between Kafka clients and brokers. It indicates the use of SASL for authentication and SSL for encryption)
Authorization: Validates whether the publisher has write access to publish data or not.
Encryption:?All communications between the brokers and clients are encrypted. Also provide option for the messages hashing like HMAC-SHA
SSL Certificates: Uses SSL certificates for secure communication.
Kafka Monitoring:
Confluent Control Center: Confluent Control Center is a commercial web-based tool for managing and monitoring Kafka clusters that allows users to view the performance, health, and consumption of brokers, topics, partitions, producers, consumers, and connectors. ?It also allows users to set up SLAs, alerts, and anomaly detection for their Kafka cluster.
Kafka Metrics: You can use tools like JConsole, JVisualVM, or any other JMX-compatible monitoring tools to connect to the Kafka brokers and view important metrics.
KSQLDB: KSQLDB allows you to define streams and tables over the data in Kafka topics, and you can then run continuous queries on these streams and tables.
?
Questions:
How Kafka maintains data consistency?
1.????? Kafka takes care of message order, consumers read all messages in the order they were published. Order is maintained via offset.
2.????? When a producer publish a record to a partition, the leader ensures that the record is replicated to the follower replicas, after that acknowledgement is sent to Kafka. This make sure that all follower replicas are in-sync replica. Producers can configure the value for "acks" is either 0, 1 or all. Default value is 1, which means the producer waits for acknowledgment from the leader broker only.
3.????? Kafka supports transactional ACID feature.
4.????? Each partition of a topic have corresponding log file, where the data is appended. e.g. if the topic named "my-topic" has two partitions, it will create two logs file named: my_topic-0 and my_topic-1
5.????? Kafka nature is: A message published to a topic is consumed by only one consumer within a consumer group.
How Publisher knows that the message is successfully published?
The publisher waits for the acknowledgment according to property value of ack=0, ack=1 or ack=all.
How Kafka knows that the message is successfully subscribed?
By default the consumer auto-commit it true. Like
spring.kafka.consumer.enable-auto-commit=true
So the offset is committed auto once the message is read successfully.
What are the advantages of Consumer group?
Parallel Processing: Consumer groups enable parallel processing of messages. Each consumer within a group can process messages independently.
Scalability: Consumer groups support horizontal scalability. As the load increases, you can add more consumers to a consumer group to distribute the processing workload.
Fault Tolerance: If one consumer fails or goes down, other consumers in the same group can continue processing messages.
Message semantics in Kafka?
Kafka offers three different semantics for message processing: "At Most Once", "At Least Once" and "Exactly Once".
Exactly once: This is the preferred behavior, each message is delivered once and only once. Messages are never lost or read twice even if some part of the system fails. Achieving "Exactly Once" semantics is challenging in distributed systems due to the potential for failures, retries, and reprocessing.
At most once: Messages are delivered once, and if there is a system failure, messages may be lost and are not redelivered.
At least once: This means messages are delivered one or more times. If there is a system failure, messages are never lost, but they may be delivered more than once.
How to achieve “exactly once” Semantics?
Idempotent Producers: Producers are configured to be idempotent. If a producer sends a message multiple times due to retries or failures, the message will be written to the topic only once. When you enable idempotence for a Kafka producer, the producer assigns a sequence number to each message it sends and includes it in the message metadata. This sequence number allows the broker to detect and eliminate duplicate messages during the write process.
enable.idempotence: "true"
Exactly once semantic can also be achieved at code level at publisher and consumer side.