登录查看更多内容

How Kafka is so efficient?

Tushar Goel

Lead Member Of Technical Staff @ Salesforce | Staff Software Engineer, Technical Lead

发布日期: 2022年2月20日

+ 关注

What is Apache Kafka?

Apache Kafka is a distributed streaming platform that allows you to::

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.

Kafka supports a high-throughput, highly distributed, a fault-tolerant platform with low-latency delivery of messages. Here, we are going to focus on the low-latency delivery aspect

A common cause for in-efficiency

·?????Disk access pattern

·?????Too many small I/O operations

·?????Excessive byte coping

Disk access pattern:

Kafka relies on the filesystem for storage and caching. Writing data on random locations is very slow as compared with writing data in append mode. i.e. writing data in one after another in a linear way. Disks are slow due to disk seek operation. Seek operation is the operation to take disk head to a particular disk sector where data is present.

A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. This is much better than maintaining a cache in a JVM application because

1.???In case of the restart we need to fetch data again

2.???Memory overhead of objects is very high

3.???JVM garbage collection is slowing as heap data increases

4.???Effective compress data storage in the hard disk

Due to the above reason, using the filesystem and relying on page cache is a better option than in-memory cache. So instead of writing in memory and flushing out to the filesystem in case of space runs out, all data are immediately written on a filesystem. More details about page-centric design are explained article.

Don't use trees:

Most often persistent storage used by messaging systems uses BTree implementation to maintain metadata information. It guarantees to provide log(n) performance to search metadata from the tree but disk operations are not that optimized. They depend upon the disk seek time and also each disk can do 1 seek at a time so parallelism is limited in this case. Due to this as the data grows performance reduces.

领英推荐

Kafka Concepts

?? Saral Saxena ?????? 6 个月前

Kafka Simplified

Abhishek Gaddhyan 1 个月前

How to Optimize Kafka Topics and Messaging

Prosigns 8 个月前

A persistent queue can be maintained with log structure format where writing can be done one after another in append mode that gives us the performance of O(1) and readers can read independently to write operations.

Too many small I/O operations

Instead of sending 1 message at a time, Kafka allows network requests to group messages together to reduce the network overhead. Server appends this chunk of messages to its log in one go and then a consumer can fetch the large chunk at a time. This helps to increase the performance multi-fold. So due to the batching producer can send the large size packets that help sequentially disk access that will follow as a stream of messages to the consumer.

?To make this more efficient Kafka also uses the compression of batches (and not individual messages) using?LZ4,?SNAPPY,?or?GZIP?codecs. This can lead to better compression ratios.

?Excessive byte copying

One of the major inefficiencies of data processing systems is the?serialization and deserialization?of data during transfers. This can be made faster by using better binary data formats, such as protocol buffers or Flat buffers, instead of JSON. But how can you avoid serialization/deserialization altogether? Kafka handles this in two ways :

·?????Use a standardized binary data format for?producers, brokers, and consumers?(so data can be passed without modification)

·?????Don’t copy the data in the application (“zero-copy”)

For the 2nd point, we need to understand the following things. A common data transfer from file to socket might go as follows:

1.???The OS reads data from the disk into pagecache in the kernel space

2.???The application reads the data from kernel space into a user-space buffer

3.???The application writes the data back into kernel space into a socket buffer

4.???The OS copies the data from the socket buffer to the NIC buffer, where it is sent over the network

However, if we have the same standardized format for data that doesn’t require modification, then we have no need for step 2 (copying the data from kernel space to user space).

If we keep data in the same format as it will be sent over the network, then we can directly copy data from pagecache to NIC buffer. This can be done through an OS?sendfile system call. More details on the zero-copy approach can be found in this?article.

References:

https://kafka.apache.org/documentation/#persistence

要查看或添加评论，请登录

Tushar Goel的更多文章

DB Isolation Levels Part -2 (Repeatable Read and Serializable)

2022年7月11日

DB Isolation Levels Part -2 (Repeatable Read and Serializable)

In the last article, we learned about Read un-committed and Read committed isolation levels and the problems linked…
DB Isolation Levels in Detail - Part -1

2022年6月27日

DB Isolation Levels in Detail - Part -1

If two transactions don't touch the same data, they can safely run in parallel because they are not dependent on each…

1 条评论
Dependency Inversion

2022年1月23日

Dependency Inversion

Dependency ingestion is ‘D’ in the SOLID design principle. It is a guideline that helps design loosely coupled classes.
Law of Demeter (Principle of least knowledge)

2022年1月15日

Law of Demeter (Principle of least knowledge)

In the last article we have discussed Tell, Don’t ask guideline. In this article, we will discuss another guideline…
Tell, Don't ask!

2022年1月8日

Tell, Don't ask!

Few days back I was having a discussion with one of my friends and we were discussing how to reduce coupling between…
Concurrency Pattern: Active object pattern

2020年10月11日

Concurrency Pattern: Active object pattern

This pattern is a type of concurrency design pattern and widely used in applications. Also, it is used to create…
JavaScript: DOM Manipulation

2020年9月6日

JavaScript: DOM Manipulation

In this article, I will explain how to do DOM manipulation using pure JavaScript. We have many other libraries…
The JavaScript EventLoop

2020年8月29日

The JavaScript EventLoop

This post will explain one of the most important concepts of JavaScript i.e.
How to use an index in MongoDB?

2020年8月15日

How to use an index in MongoDB?

In this post, I will explain how to use an index in MongoDB. In an earlier article, I explained how indexes work.

3 条评论

See all articles

How Kafka is so efficient?

Tushar Goel

Lead Member Of Technical Staff @ Salesforce | Staff Software Engineer, Technical Lead

Disk access pattern:

Don't use trees:

领英推荐

Too many small I/O operations

Tushar Goel的更多文章

社区洞察

其他会员也浏览了

How to Optimize Kafka Topics and Messaging

Introduction to Observability in Kafka Multi-Tenant Architectures

Advanced Concepts in Apache Kafka

?? Apache Kafka Internals-Part1

Understanding Apache Kafka: A Detailed Guide

Apache Flink vs. Kafka Streams: A Comprehensive Comparison

Comparing Apache Kafka and Apache Pulsar: A Comprehensive Technical-Professional Analysis

Kafka's Evolution: Zookeeper vs. KRaft

Apache Kafka: Integration and Use in Ruby on Rails Applications

Kafka and Zookeeper: The Dynamic Duo of Distributed Systems

Disk access pattern:

Don't use trees:

领英推荐

Too many small I/O operations

Tushar Goel的更多文章

DB Isolation Levels Part -2 (Repeatable Read and Serializable)

DB Isolation Levels in Detail - Part -1

Dependency Inversion

Law of Demeter (Principle of least knowledge)

Tell, Don't ask!

Concurrency Pattern: Active object pattern

JavaScript: DOM Manipulation

The JavaScript EventLoop

How to use an index in MongoDB?

社区洞察

其他会员也浏览了

How to Optimize Kafka Topics and Messaging

Introduction to Observability in Kafka Multi-Tenant Architectures

Advanced Concepts in Apache Kafka

?? Apache Kafka Internals-Part1

Understanding Apache Kafka: A Detailed Guide

Apache Flink vs. Kafka Streams: A Comprehensive Comparison

Comparing Apache Kafka and Apache Pulsar: A Comprehensive Technical-Professional Analysis

Kafka's Evolution: Zookeeper vs. KRaft

Apache Kafka: Integration and Use in Ruby on Rails Applications

Kafka and Zookeeper: The Dynamic Duo of Distributed Systems