All About Kafka Reliability

Kafka is extremely flexible when it comes to its usage. The use case of kafka varies from “capturing user clicks” (highly available) to “bank transactions” (highly consistent). To achieve reliability in kafka, it is a joint responsibility of kafka admins and application teams. In this blog, we will see what are the different components of kafka when it comes to reliability of message delivery.

My assumption while writing this blog is that readers have basic knowledge of Kafka broker, producer and consumer. 

There are four important components:


  1. Broker:

Unclean Leader Election : Here the scenario is that one of leader broker has crashed and other brokers are going through election process to find new leader. Also In-sync replicas are not in sync with leader. Now here we have two options when it comes to build reliable system:

  1. Wait for leader to come online : This will impact availability but data remains consistent as leader has the latest messages which other replicas will lookup for staying in sync.
  2. Make In-sync (Out of sync in this case) replica as a leader : This will give high availability but data will be in consistent. E.g assume that latest offset at the leader was 500 when it went down and in-sync replicas were able to sync till 400. Now when one of in-sync replica becomes leader then it will start storing messages from offset location 400 onwards. This will lead to loss of 100 messages which original leader received from producer and acknowledged successfully but could not sync to others replica and hence were never visible to consumers.

To address message loss due to broker issue, we can use Min. In-Sync Replicas config to 2. This configuration can be applied both at broker and topic level. If value is set to 2, then at least 2 replicas should be in sync before producer gets successful response. This will prevent message loss to an extent possible with respect to broker failure. 

But these configurations do not give us guarantee about no message loss until producer and consumers are built in reliable manner. Let us check what role producer will play if someone sets ack = 1.


2. Producers: 

Lets understand all types of "send" acknowledgements :

  1. Ack = 0 : It only takes care of network failure means kafka client will try of its own to reach out to broker but can not confirm if message gets saved in a broker. Hence it is highly unreliable option.
  2. Ack = 1 : Leader broker will send successful ack once data is written to its partition but it carries a risk of data not being replicated to another in-sync replicas in case leader crashes after sending ack to producer. This is the same scenario which we discussed above in Broker section and tried to use Min. In-Sync replicas to 2. Here producer ack = 1 will not wait for replication to complete to another in-sync replicas. Hence we still carry a risk of losing messages.
  3. Ack = ALL : It is the safest option and provides highest reliability along with Min in-sync replica option. But it causes slowness in performance. Hence one can use async produce option along with larger batch sizes to get the best out of kafka producers and achieve high reliability. 

Retriable and Non-Retriable Errors :

Producers come with Retriable errors like LEADER_NOT_AVAILABLE. So kafka producers should leverage them as much as possible and keep retrying until messages are published successfully. But there will be some scenarios where you want to try for limited number of times and handle issues manually. We will discuss about the strategy for the same in next section. 

In contrast to retriable, there are non-retriable errors like INVALID_CONFIG or MESSAGE_TOO_LARGE. It means that producer has to handle this in their code or fix the specific issues before sending message again to broker. Most of these issues should be caught in testing but still if such scenario arise then there should be an ability in application to handle on adhoc basis. 

Now lets understand the strategy for handling non-retriable errors or exhausted retries scenarios.

  • Save/Log messages to take action in future.
  • Build replay mechanism which can be used to replay all failed messages on adhoc basis after fixing message producing logic or config issues.
  • Ability to handle duplicates at consumer level : Here the scenario is that producer does not get response from broker while broker got the message and saved successfully for in-sync replicas. Producer will try again after timeout property “retry.backoff.ms”. Hence, producer will send duplicate message. Therefore this scenario requires consumers to handle duplicate messages. We will see it in next section.


3. Consumers:

First of all it is important to understand the difference between committed Message Vs Committed Offset. 

Committed Message : It is the one which kafka commits successfully after replicating to all in-sync replicas. 

Committed Offset : It is the message which consumer of a given consumer group has consumed from the particular location and committed the offset.

Consumers work in three modes :

(a) Auto : Here we have four important properties which come in use.

  • group.id : It is the one which consumer use to register with the consumer group
  • Auto.offset.reset : while consumer is reading for first time or resuming after restart, it needs to know from which location it needs to start reading the messages. Either we can use ‘’Earlier” to start from the earliest offset of partition while reading messages and it will lead to reading of duplicate messages or we can use “Latest” to start reading from the last committed offset but this may not be reliable option as we will see in next section.
  • enable.auto.commit : This will commit offset automatically for every poll loop. In a poll method one will try to pull multiple messages from a partition hence every completion of loop will result into saving of offset if this property is set to true.
  • Auto.commit.interval.ms : This is again an auto commit but time based. Default value is 5 sec.

As Auto feature does not give control to consumer on the offset commit hence Auto mode is not considered reliable from system stand point.

(b) Manual :

  • Always commit offset once events are processed. It means your poll method should contain the processing logic like saving messages to DB or doing some translation logic before it calls commit()
  • Commit Frequency : Calling commit() on every poll loop will consume much resources on broker side as it increases the bookkeeping activity. Hence commit() should be called once all the messages from the poll() call get processed.
  • Rebalance : This scenario can happen anytime as and when new consumers are added or existing consumers are dropped. But consumer code should gracefully handle commit() before moving out of the group. Therefore commit() method should be called in the finally block or finalize method.
  • Consumer Retries : Consumer should poll message again if not processed successfully earlier polled messages. If processing is taking time then store the messages in the buffer and use consumer.pause(). This will make sure that heartbeat is sent to broker and keep consumer alive in the group.
  • Failed Messages : If consumer is not able to process messages then it can publish failed messages to separate topic. Here separate consumer group can be used to handle retries from the retry topic. This is similar to Dead Letter queue technique which is used in JMS based implementations.
  • Handling long processing time : In addition to previous step where we used consumer.pause() for retires, we should use separate workers (having parallel threads) to process the messages faster. Responsibility lies with consumer application to write efficient message handling mechanism.

(c) Replay Utility :

It is an extension to manual mode as it needs manual intervention to process failed messages at the consumer side. Above I mentioned about handling of failed messages using separate topic and consumer group. But what if there is the fix required to handle failed messages and it is going to take time more than the retention period of a given topic. Therefore we need long term persistence. Here one can use topic name, partition #, offset to uniquely identify the message and persist it in HDFS system below replaying on adhoc basis. This utility should be very specific to consumer application.


4. Message Delivery:

In previous section I mentioned about the delivery of duplicate message by publisher in case it does not get response from broker due to network issue but broker received message and saved it successfully. Hence, we will call delivery of message as "at-least once". To handle duplicate messages at consumer side, consumer should be made idempotent. This is done by using unique keys for each message (possibly by usage of domain key). Therefore kafka supports “at-least-once” delivery by default.

Kafka has come up with new transactional producer capability (0.11.0.0) which can be leveraged to have “exactly once” delivery of message.

And the last one is “almost-once” which can be achieved by disabling retries at the producer end and committing offset prior to processing the batch of messages.


By using combination of Broker, producer, consumer and message delivery, one can choose degree of reliability as per the use case and build it effectively using Kafka.

Vijaya Annapurna

Global Product Manager

4 年

Hey Gaurav Chopra what makes this interesting read for me: 1) selection of use cases to explain concept 2) pros and cons of configurations 3) simple explanation 4) great teacher knows how to explain things in most simplistic way

回复

Such simple language makes it easier to grasp and understand Gaurav. Thanks for sharing, sure will help many seekers.

回复
Akash D P

Software Engineer | DevOps | Azure | Kubernetes | Automation | CICD

4 年

Technically very well written Gaurav! Thanks for sharing it here

要查看或添加评论,请登录

社区洞察

其他会员也浏览了