登录查看更多内容

Tuning Kafka for High Performance and Scalability

Mostafa Ghadimi

Data Consultant | Senior Data Engineer | Bigdata & Algorithms student at university of Tehran | Computer Engineering student at Sharif university of technology

发布日期: 2023年3月2日

In the scale of Digikala , the largest e-commerce platform in the Middle east of Asia with more than 2 million unique visits per day, tracking users' activities and analyzing them are so important for the stakeholders in order to have insight about what is going on in the business. It’s increasingly important for businesses to have a clear view of all their data to stay competitive, which is where business intelligence (BI) tools come in. For this purpose, it is required to provide reliable infrastructure in order not to lose any events at high rate and volumes.?

Events can be from the following list:

Pageviews
User source and medium
Session duration or time on page
Visitor location
Device type (i.e. desktop, tablet, or mobile)
etc.

Based on the documentation, the original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

Basic Kafka Tuning

There are several different approaches in order to tune the performance of Kafka. Generally, the tuning can be done in three different ways:

Tuning the storage layer (durability/availability)
Tuning the throughput/latency
Change the commodity hardware

It is important to note that each approach can be opposed to another. Engineers should have a clear understanding of the key requirements.

We can take a look at the tuning using the following image from RedHat developers article:

Tuning the storage layer (durability/availability):

Apache Kafka offers several compression methods that can be used to reduce the amount of data that is transmitted and stored. These compression methods include gzip, snappy, lz4, and zstd. Each method has its own tradeoffs in terms of compression ratio, speed, and memory usage.

Several benchmark tests have been conducted to compare the performance of these compression methods in Kafka. In general, lz4 and snappy are faster and have lower CPU usage compared to gzip and zstd, which can be useful in scenarios where low latency is important. On the other hand, gzip and zstd generally have higher compression ratios, which can be beneficial in scenarios where storage space is at a premium.

In terms of overall efficiency, lz4 is often considered to be a good compromise between compression ratio and speed. It offers a good balance of compression ratio and CPU usage, and can be a good choice for most Kafka workloads. However, it's important to note that the optimal compression method can vary depending on the specific use case and requirements.

In summary, Apache Kafka offers several compression methods that can be used to reduce data transmission and storage requirements. While the best compression method can vary depending on the specific use case, lz4 is often considered to be a good balance between compression ratio and speed.

Tuning the throughput/latency:

Partition Number: Increasing the number of partitions in Kafka can have a significant impact on performance. Partitions are a fundamental unit of parallelism in Kafka, and they enable multiple consumers to read data from a topic concurrently.

In general, increasing the number of partitions can lead to higher throughput and better scalability, especially in scenarios where there are many consumers reading data from the same topic. This is because each partition can be read and processed independently, which can result in better load balancing and reduced contention.

However, there are also trade-offs to consider when increasing the number of partitions. For example, increasing the number of partitions can increase the complexity of managing Kafka clusters and may require additional resources such as increased memory and storage. In addition, having too many partitions can also result in reduced throughput and higher latencies due to the increased overhead of managing a larger number of partitions.

Therefore, it's important to carefully consider the number of partitions that are needed for a particular Kafka workload, taking into account factors such as the number of consumers, the amount of data being processed, and the available resources.

In general, a good rule of thumb is to aim for a partition count that is roughly equal to the number of consumer groups that are reading from the topic. This can help to ensure that each consumer group is able to process data independently without causing contention or performance issues.

In summary, increasing the number of partitions in Kafka can lead to higher throughput and better scalability, but it's important to carefully consider the tradeoffs and ensure that the partition count is appropriate for the specific workload and available resources.

PageCache vs in-memory cache for Kafka use-case

In-memory: if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.

Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things: The memory overhead of objects is very high, often doubling the size of the data stored (or worse). Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.

PageCache

In essence the Page Cache is a part of Virtual File System (VFS) which main purpose, as you can guess, is improving IO latency of read and write operations.

In computing, a page cache, sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD) or a solid-state drive (SSD). The operating system keeps a page cache in otherwise unused portions of the main memory (RAM), resulting in quicker access to the contents of cached pages and overall performance improvements. A page cache is implemented in kernels with the paging memory management, and is mostly transparent to applications.

领英推荐

The Rise of EtLT(Extract, Tweak Light Transform, Load,…

XenonStack 6 个月前

Intro to the Iceberg Kafka Connect Sink

Tabular (now part of Databricks) 1 年前

Optimize GraphQL APIs with DataLoader: Eliminating the…

Centizen, Inc. 11 个月前

When compared to main memory, hard disk drive read/writes are slow and random accesses require expensive disk seeks; as a result, larger amounts of main memory bring performance improvements as more data can be cached in memory. Separate disk caching is provided on the hardware side, by dedicated RAM or NVRAM chips located either in the disk controller (in which case the cache is integrated into a hard disk drive and usually called disk buffer), or in a disk array controller. Such memory should not be confused with the page cache.

Pages in the page cache modified after being brought in are called dirty pages. Since non-dirty pages in the page cache have identical copies in secondary storage (e.g. hard disk drive or solid-state drive), discarding and reusing their space is much quicker than paging out application memory, and is often preferred over flushing the dirty pages into secondary storage and reusing their space. Executable binaries, such as applications and libraries, are also typically accessed through page cache and mapped to individual process spaces using virtual memory (this is done through the mmap system call on Unix-like operating systems).

The page cache also aids in writing to a disk. Pages in the main memory that have been modified during writing data to disk are marked as "dirty" and have to be flushed to disk before they can be freed. When a file write occurs, the cached page for the particular block is looked up. If it is already found in the page cache, the write is done to that page in the main memory. If it is not found in the page cache, then, when the write perfectly falls on page size boundaries, the page is not even read from disk, but allocated and immediately marked dirty. Otherwise, the page(s) are fetched from disk and requested modifications are done. A file that is created or opened in the page cache, but not written to, might result in a zero byte file at a later read.

However, not all cached pages can be written to as program code is often mapped as read-only or copy-on-write; in the latter case, modifications to code will only be visible to the process itself and will not be written to disk.

By reading the information about how page cache works with kernel, it is required to change some variables in kernel (which are located in /proc/sys/vim), and monitor the results.

vm.dirty_background_ratio is the percentage of system memory which when dirty, causes the system to start writing data to the disk.

vm.dirty_ratio is the percentage of system memory which when dirty, causes the process doing writes to block and write out dirty pages to the disk.

Change the commodity hardware

The type of storage used for Kafka can have a significant impact on its performance. In general, solid-state drives (SSDs) offer better performance compared to hard disk drives (HDDs) due to their faster read and write speeds. This can be especially important for workloads that involve a high volume of small messages or require low latency, such as real-time data processing.

One of the key benefits of using SSDs with Kafka is improved write performance. Because Kafka relies heavily on disk writes to store and persist data, faster write speeds can result in lower latencies and higher throughput. SSDs also tend to have lower seek times than HDDs, which can further improve performance in scenarios where many small messages are being written and read frequently.

That being said, HDDs can still be a viable option for Kafka in certain scenarios, such as when cost is a primary consideration or when data storage requirements are very large. However, it's important to note that using HDDs may result in higher latencies and lower throughput compared to SSDs.

In summary, while both hard drives and SSDs can be used with Kafka, using SSDs can lead to better performance in terms of lower latency and higher throughput, especially in workloads that involve a high volume of small messages or require low latency.

CAP and PACELC Theorem

The CAP theorem is a concept in distributed computing that describes the tradeoffs between consistency, availability, and partition tolerance. According to the theorem, it is impossible for a distributed system to simultaneously provide all three of these guarantees in the event of a network partition. In other words, when a partition occurs and network nodes can't communicate with each other, a system can only achieve two out of three guarantees: consistency, availability, and partition tolerance.

The PACELC theorem is a framework used to analyze the tradeoffs that must be made in distributed systems when it comes to consistency, availability, and partition tolerance. Similarly, the CAP theorem is another framework that focuses on consistency, availability, and partition tolerance tradeoffs in distributed systems. Both theorems are relevant to Kafka because it is a distributed system designed to handle large volumes of data across multiple nodes in a cluster. As Kafka relies on partitioning data across multiple nodes, it must consider the tradeoffs of consistency and availability in the face of network partitions. Therefore, understanding the PACELC and CAP theorems is essential for making informed decisions about the configuration and tuning of Kafka clusters to ensure they meet the needs of the use case while maintaining consistency and availability.

Experiment Results

Digikala has migrated its Kafka cluster storage solution from HDD to SSD. As a result of using the gzip compression method, the data size has been reduced by up to ten times.?

We have set the dirty_ratio value to 40 and dirty_background_ration to 80 using the following commands:

echo 40 > /proc/sys/vm/dirty_ratio

echo 80 > /proc/sys/vm/dirty_background_ratio

Our cluster works perfectly with 30 times more data load than usual in promotions and spike tests with Kafka.

Conclusion

In conclusion, Apache Kafka offers several approaches to improve performance and scalability, which are essential for businesses that need to track and analyze a high volume of events. Tuning the storage layer can be done by using compression methods, such as gzip, snappy, lz4, and zstd, which can reduce data transmission and storage requirements. On the other hand, tuning the throughput/latency can be achieved by increasing the number of partitions in Kafka, which can lead to higher throughput and better scalability, although it requires careful consideration of the tradeoffs. Finally, choosing between in-memory cache and page cache depends on the specific use case, but the Page Cache is generally recommended for improving IO latency of read and write operations, resulting in quicker access to the contents of cached pages and overall performance improvements. By following these recommendations, businesses can build a reliable infrastructure that can handle high rates and volumes of events without losing any data.

Acknowledgements

Viacheslav Biriukov In appreciation of his excellent post on page cache on his blog.

I am grateful to Mojtaba Akbari , Arshia Akhavan , Dorsa Hasanlee , and Niyusha Baghayi at Digikala for assisting me in maintaining, monitoring, developing, and tuning the Kafka cluster.