登录查看更多内容

Practical Tips for Kafka Partitioning, Compression, Retention, and Swimlanes

Suyash Gogte

Staff Engineer (@Warner Bros. Discovery) | Ex-VMware, Ex-Amazon

发布日期: 2024年3月23日

+ 关注

Kafka is widely used in distributed environments. It's very important to understand a few things in detail.

1) How many partitions kafka topic should have :

In kafka, as you know partition is unit of parallelism. I have seen that while creating topic, first question comes to mind , how many partitions topic should have. Confluent has given a formula for this

max(t/p, t/c) partitions, where:

t = target throughput
p = measured throughput on a single production partition
c = consumption

but it's really difficult to figure out target throughput when you actually think about scale. This is bit difficult formula to apply, because we need to consider many factors those can impact throughput. Like if replication is synchronous then we will get less throughput, similarly if compression is used or not . If yes then which one , so and on so forth. Generally, I take the below basic things into consideration and then I do load test

message size - Smaller messages are the harder problem for a messaging system as they magnify the overhead of the bookkeeping the system does. If message/ record size is <100bytes then there will be overhead of the bookkepping system does.
compression - we will talk more about this
replication - mostly asynchronous, but depends on the requirements
broker size/number of brokers (or replication factor)
retention period
rate of consumption

I get the details of replication factor and broker size being used in production , and then I start load testing in dev environment with same configuration. Generally I start with 8 partitions, and then observe, am I getting real time updates or not , or is system able to sustain ? if not then I try to increase number of partitions and also apply other optimisations.But we should not have too many partitions because More partitions requires more open file handles.

Each partition maps to a directory in the file system in the broker.So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. But if we are using managed services in cloud , then it may not be possible to change this limit.

2) Compression: There is high message volume, large message size and limited network bandwidth then go for compression , so that network bandwidth can be optimised. In most of the cases, compression works well, but yes as compression comes into picture then there is CPU overhead.

In general, lz4 is recommended for performance . It provides low to moderate compression ratio.
gzip is not recommended due to high overhead
If you’re looking for a compression ratio similar to gzip but with less CPU overhead, give zstd a try.
First of all, remember that encrypted data should not be compressed; encryption is random so the result generally doesn’t compress well.
messages are batched and compressed at producer end, while setting batch.size , keep in mind that the advantage of a small batch size is that it saves memory and decreases latency; on the other hand, a large batch size increases throughput but takes up more memory. "linger.ms", is the number of milliseconds a producer is willing to wait before sending a batch out. Good starting value can be in range of 16kb to 64kb and 5ms to 20 ms for linger.ms
set topic's compression.type=producer to avoid unnecessary decompression at broker side.

3) retention period : The retention period determines how long Kafka will retain messages in a topic before they are eligible for deletion. Don't forget to set appropriate value to "retention.ms" at topic level. I have seen that if we don't set the correct value, then Kafka brokers' disk gets full, and the broker goes down, causing the entire cluster to go into a toss. This results in production escalations.

领英推荐

Advanced Caching Techniques for GraphQL APIs

Centizen, Inc. 10 个月前

OSS Kubernetes and Container Storage Interface (CSI)…

PibyThree 1 年前

5 Reasons Why Lightbits Outperforms Ceph Storage for…

Lightbits Labs 10 个月前

4) Kafka Swimlanes : This is not the term used by Kafka open source community, but Hubspot introduced this to handle imbalanced traffic.

In case of Hubspot, multiple customers were using same topic. So if there is sudden burst of traffic from particular customer then it will build up lag and introduce delays. The fundamental problem here is that all of traffic, for all of customers, is being produced to the same queue.

If there is sudden burst of traffic that’s coming in faster than the real time swimlane can accommodate, then excess traffic will be sent to the overflow swimlane. Now when to send to overflow traffic will be decided at source based on rate limiter.

要查看或添加评论，请登录

Suyash Gogte的更多文章

Predictive Autoscaling with Machine Learning: Lessons from Coinbase

2025年1月5日

Predictive Autoscaling with Machine Learning: Lessons from Coinbase

The advent of AI and ML has significantly transformed backend engineering, especially in capacity planning for…

6 条评论
Netflix's Pushy Notifications: A Deep Dive

2025年1月1日

Netflix's Pushy Notifications: A Deep Dive

In our last article, we explored how Gojek delivers push notifications to devices. This time, we’ll dive into an…
Optimizing Push Notifications with Gojek’s Courier Architecture

2024年12月21日

Optimizing Push Notifications with Gojek’s Courier Architecture

In today’s digital landscape, nearly every application relies on push notifications, whether it’s a food delivery app…

1 条评论
How Slack Delivers Millions of Messages Every Day

2024年12月17日

How Slack Delivers Millions of Messages Every Day

I’ve been using Slack for many years, and it’s one of the products I truly admire for its simplicity and efficiency. I…
From Redis to Kafka: How Slack Reinvented Their Job Queue Architecture

2024年9月28日

From Redis to Kafka: How Slack Reinvented Their Job Queue Architecture

Some problems in scaling are fascinating, and I came across one that Slack encountered while scaling their job queue…
Understanding Distributed Locking with Redis: Practical Applications and Challenges

2024年9月19日

Understanding Distributed Locking with Redis: Practical Applications and Challenges

In many cases, especially in distributed systems, we encounter scenarios where we need to ensure that certain…
Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

2024年9月4日

Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

In this article, we'll delve deep into the analytics aspect of System Design, particularly focusing on the evolution of…
Kubernetes Leases for Leader Election: Ensuring High Availability

2024年8月20日

Kubernetes Leases for Leader Election: Ensuring High Availability

In distributed systems, ensuring high availability and fault tolerance is crucial. Kubernetes, a popular container…

1 条评论
Enhancing recommendations with pgVector and collaborative filtering

2024年8月3日

Enhancing recommendations with pgVector and collaborative filtering

In our previous article, we explored how OpenSearch can be utilized for auto-completion and 'search as you type'…
Introduction to OpenSearch for Designing Search Functionality

2024年8月1日

Introduction to OpenSearch for Designing Search Functionality

During a system design interview, I once asked a candidate to design a search feature for an Netflix, where users…

1 条评论

See all articles

Practical Tips for Kafka Partitioning, Compression, Retention, and Swimlanes

Suyash Gogte

Staff Engineer (@Warner Bros. Discovery) | Ex-VMware, Ex-Amazon

领英推荐

Suyash Gogte的更多文章

社区洞察

其他会员也浏览了

Kafka Mastery: Essential Strategies for Scaling, Best Practices, and Cost Efficiency

Speedb Launches Enterprise RocksDB Technical Support Program

Kafka vs SQS: Similarities

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

Denodo Platform Cache

Performance vs. Scalability in System Design

From RDS-Centric to Distributed Systems: An Evolution Towards Eventual Consistency and Simplified Development with Managed Services

Ensuring Data Reliability in Apache Kafka

Observability Challenges in Kafka Multi-Tenant Architectures

Introduction to Observability in Kafka Multi-Tenant Architectures

领英推荐

Suyash Gogte的更多文章

Predictive Autoscaling with Machine Learning: Lessons from Coinbase

Netflix's Pushy Notifications: A Deep Dive

Optimizing Push Notifications with Gojek’s Courier Architecture

How Slack Delivers Millions of Messages Every Day

From Redis to Kafka: How Slack Reinvented Their Job Queue Architecture

Understanding Distributed Locking with Redis: Practical Applications and Challenges

Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

Kubernetes Leases for Leader Election: Ensuring High Availability

Enhancing recommendations with pgVector and collaborative filtering

Introduction to OpenSearch for Designing Search Functionality

社区洞察

其他会员也浏览了

Kafka Mastery: Essential Strategies for Scaling, Best Practices, and Cost Efficiency

Speedb Launches Enterprise RocksDB Technical Support Program

Kafka vs SQS: Similarities

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

Denodo Platform Cache

Performance vs. Scalability in System Design

From RDS-Centric to Distributed Systems: An Evolution Towards Eventual Consistency and Simplified Development with Managed Services

Ensuring Data Reliability in Apache Kafka

Observability Challenges in Kafka Multi-Tenant Architectures

Introduction to Observability in Kafka Multi-Tenant Architectures