登录查看更多内容

How Kafka achieves its design goals (Part I)

Hoan Tran Viet

? DevOps Engineer - A Soldier's Son

发布日期: 2024年12月14日

In recent years, almost of us have been using Kafka for many use cases such as message brokers, activity tracking, and event sourcing. Have you ever been curious about why so many companies even big ones choose Kafka?

This series will explore the motivations behind Apache's development of Kafka and how its design addresses these challenges. In the first article, we will focus on Kafka's solutions in data storing, caching, and transferring.

Motivations

According to Apache, Kafka is designed as a unified platform for handling all the real-time data feeds a large company might have. This goal requires the following requirements:

High-throughput to support high-volume event streams such as real-time log aggregation.
Gracefully with large data backlogs to support periodic data loads from offline systems
Low-latency delivery to handle more traditional messaging use cases.
Guarantees fault tolerance in the presence of machine failures

Solutions

Data storing and caching

Kafka stores messages in append-only logs, and their contents are immutable. This approach improves throughput and can deal gracefully with large data backlogs.

Sequential Disk Access: Modern disks are extremely efficient at sequential reads and writes (hundreds of MB/s). Kafka uses a log-structured approach instead of random seeks. For example, a 7200rpm SATA RAID-5 array performs about 600MB/sec for sequential writes but only 100k/sec for random writes—this is a 6000x difference.
Pagecache and Memory: Kafka?writes data to a persistent log?immediately and relies on the operating system's?page cache?rather than maintaining large in-memory caches. This ensures that?data persists?(even after a restart), avoids slow rebuilds of the in-memory cache, and allows better?cache coherency?handled by the OS.
Message Retention: Kafka’s persistent logs design allows it to retain messages for extended periods (e.g., a week) instead of deleting them immediately after consumption.
Simplify Data Structures: Kafka’s append-only log (O(1) operations) bypasses seek time issues from complex data structures like BTREE. It provides stable performance even as data volumes grow.

领英推荐

Optimizing Real-Time Databases for Performance and…

Vishal Mane 5 个月前

Virtualization + Lakehouse + Mesh = Data At Scale

Alex Merced 5 个月前

Kafka Schema Registry

?? Saral Saxena ?????? 8 个月前

Data transfering

Kafka uses several techniques to speed up data transfer like batching, compression, common file format, and especially Zero-copy. This helps Kafka achieve high throughput and low latency delivery.

Batching I/O: Grouping messages together into sets reduces the overhead of numerous small I/O operations and network roundtrips. It leads to larger network packets, larger sequential disk operations, and contiguous memory blocks.
Minimizing Byte Copies: A standardized binary message format shared by producers, brokers, and consumers avoids unnecessary copying. Data chunks can be transferred without modification.
Zero-Copy Optimization: Kafka applies zero-copy optimization to transfer data directly from the OS page cache to a network socket, avoiding redundant data copies and system calls for user-space buffers and socket buffers.
End-to-End Batch Compression: Kafka optimizes network bandwidth by supporting batch compression. Instead of compressing individual messages, Kafka compresses batches of messages together, improving compression ratios and reducing network load.

Keynote

Kafka's design prioritizes high throughput, persistent, low latency, and fault tolerance. The log-centric architecture and sequential disk access help it utilize modern disk's read, write, and cache operations. Combined with data transfer techniques like zero-copy and batch compression, Kafka can handle enormous streams of events efficiently and reliably.

In the next article, we will explore Kafka's solutions for message brokers, distribution, replication, and fault tolerance.

References:

Jewel Luther

Application Architect, Microservices, Java, Kafka, ML, Docker/Kubernetes, Azure, AWS, PCF, GCP

3 个月

To investigate why Kafka solution failing we can monitory the performance of our Kafka topics and track the flow of data. We can use updated connectors. I suggest use Kafka Tools (KT) for live monitoring, here is the link https://kafkatools.com/

1 次回应

Vu Duc Huy

?Backend Developer

3 个月

Good to know !

查看更多评论

要查看或添加评论，请登录

Hoan Tran Viet的更多文章

What exactly are VPN secure tunnels?

2025年3月16日

What exactly are VPN secure tunnels?

Most of us have used a VPN at least once—maybe to bypass website restrictions or securely access private company…

2 条评论
MAC vs. IP Addresses: Why We Need Both?

2025年3月2日

MAC vs. IP Addresses: Why We Need Both?

I'm writing this article after drinking a couple of beers. It will not be formal and concise, but it is my spontaneous…

1 条评论
How are Secret keys exchanged through insecure networks?

2025年2月23日

How are Secret keys exchanged through insecure networks?

In the previous post, we learned about the combination of symmetric keys (used for session data encryption) and…

7 条评论
How Kubernetes authenticate internal access?

2025年2月9日

How Kubernetes authenticate internal access?

When you access the Kubernetes API server, you authenticate as a regular user. But what happens when Pods start making…
How are types of Cryptography combined in our daily activities?

2025年1月19日

How are types of Cryptography combined in our daily activities?

Nowadays, we spend much time on the Internet for reading news, watching videos, or surfing social networks. But have…
How Kafka achieves its design goals (Part II)

2024年12月25日

How Kafka achieves its design goals (Part II)

Following the previous article, we continue to explore key features of Kafka's design that help it achieve the target…
How does HDD physically work?

2024年11月24日

How does HDD physically work?

I've used hard disk drives since I first started using computers. Before SSDs and cloud storage became prevalent, HDDs…
Analog recording history (Part III - Vinyl)

2024年11月17日

Analog recording history (Part III - Vinyl)

In the previous parts, we have explored phonograph cylinders which used cylinders as the medium to store audio signals.…
How the Edison Phonograph works

2024年11月2日

How the Edison Phonograph works

In the previous part, we explored the early history of analog sound recording. Edison's phonograph, invented by Thomas…
Analog audio recording history (Part I - Phonograph)

2024年10月27日

Analog audio recording history (Part I - Phonograph)

Cassette players preserve many memories of Vietnamese people from the 1970s to 1990s. At that time, my country was…

2 条评论

See all articles

How Kafka achieves its design goals (Part I)

Hoan Tran Viet

? DevOps Engineer - A Soldier's Son

Motivations

Solutions

Data storing and caching

领英推荐

Data transfering

Keynote

Hoan Tran Viet的更多文章

社区洞察

其他会员也浏览了

Kafka Schema Registry

Understanding Lambda and Kappa Architectures: Which One is Right for Your Big Data Strategy?

Building Fault-Tolerant Distributed Data Pipelines: Challenges and Best Practices

Delta Lake, Iceberg & Hudi: A Transactional Perspective

The Kafka Report 006: Latest Kafka Trends, Playbooks, and Resources

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Navigating Big Data with Kafka: A Beginner's Guide

Why Open Table Formats and Apache Iceberg Are Reshaping Data Engineering

?? What is Kafka? Why is it Crucial for High-Scalable and High-Traffic Applications?

Application Design: Key Principles For Data-Intensive App Systems

Motivations

Solutions

Data storing and caching

领英推荐

Data transfering

Keynote

Hoan Tran Viet的更多文章

What exactly are VPN secure tunnels?

MAC vs. IP Addresses: Why We Need Both?

How are Secret keys exchanged through insecure networks?

How Kubernetes authenticate internal access?

How are types of Cryptography combined in our daily activities?