Practical Tips for Kafka Partitioning, Compression, Retention, and Swimlanes
practical tips for widely used platform i.e kafka

Practical Tips for Kafka Partitioning, Compression, Retention, and Swimlanes

Kafka is widely used in distributed environments. It's very important to understand a few things in detail.

1) How many partitions kafka topic should have :

In kafka, as you know partition is unit of parallelism. I have seen that while creating topic, first question comes to mind , how many partitions topic should have. Confluent has given a formula for this

max(t/p, t/c) partitions, where:

t = target throughput
p = measured throughput on a single production partition
c = consumption        

but it's really difficult to figure out target throughput when you actually think about scale. This is bit difficult formula to apply, because we need to consider many factors those can impact throughput. Like if replication is synchronous then we will get less throughput, similarly if compression is used or not . If yes then which one , so and on so forth. Generally, I take the below basic things into consideration and then I do load test

  • message size - Smaller messages are the harder problem for a messaging system as they magnify the overhead of the bookkeeping the system does. If message/ record size is <100bytes then there will be overhead of the bookkepping system does.
  • compression - we will talk more about this
  • replication - mostly asynchronous, but depends on the requirements
  • broker size/number of brokers (or replication factor)
  • retention period
  • rate of consumption

I get the details of replication factor and broker size being used in production , and then I start load testing in dev environment with same configuration. Generally I start with 8 partitions, and then observe, am I getting real time updates or not , or is system able to sustain ? if not then I try to increase number of partitions and also apply other optimisations.But we should not have too many partitions because More partitions requires more open file handles.

Each partition maps to a directory in the file system in the broker.So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. But if we are using managed services in cloud , then it may not be possible to change this limit.

2) Compression: There is high message volume, large message size and limited network bandwidth then go for compression , so that network bandwidth can be optimised. In most of the cases, compression works well, but yes as compression comes into picture then there is CPU overhead.

  1. In general, lz4 is recommended for performance . It provides low to moderate compression ratio.
  2. gzip is not recommended due to high overhead
  3. If you’re looking for a compression ratio similar to gzip but with less CPU overhead, give zstd a try.
  4. First of all, remember that encrypted data should not be compressed; encryption is random so the result generally doesn’t compress well.
  5. messages are batched and compressed at producer end, while setting batch.size , keep in mind that the advantage of a small batch size is that it saves memory and decreases latency; on the other hand, a large batch size increases throughput but takes up more memory. "linger.ms", is the number of milliseconds a producer is willing to wait before sending a batch out. Good starting value can be in range of 16kb to 64kb and 5ms to 20 ms for linger.ms
  6. set topic's compression.type=producer to avoid unnecessary decompression at broker side.

3) retention period : The retention period determines how long Kafka will retain messages in a topic before they are eligible for deletion. Don't forget to set appropriate value to "retention.ms" at topic level. I have seen that if we don't set the correct value, then Kafka brokers' disk gets full, and the broker goes down, causing the entire cluster to go into a toss. This results in production escalations.

4) Kafka Swimlanes : This is not the term used by Kafka open source community, but Hubspot introduced this to handle imbalanced traffic.

In case of Hubspot, multiple customers were using same topic. So if there is sudden burst of traffic from particular customer then it will build up lag and introduce delays. The fundamental problem here is that all of traffic, for all of customers, is being produced to the same queue.

If there is sudden burst of traffic that’s coming in faster than the real time swimlane can accommodate, then excess traffic will be sent to the overflow swimlane. Now when to send to overflow traffic will be decided at source based on rate limiter.








要查看或添加评论,请登录

Suyash Gogte的更多文章

社区洞察

其他会员也浏览了