登录查看更多内容

Data Management Internals - KStreams and KTables

Utkarsh Upendra

Software Engineer | AT&T | Ex Deliveroo, UHG

发布日期: 2020年3月17日

Kafka Streams are a great way of building stream processing applications. Built on top of Kafka Producer and Consumer libraries, the Kafka Streams library leverages all the capabilities natively offered by Kafka like horizontal scalability, parallel data processing, fault-tolerance and much more. In this article, I am going to cover more about how do Kafka Streams work internally.

Since Kafka Streams library is derived from Kafka Producer and consumer libraries, the way it processes data isn't much different. A partition on the stream maps to a partition on the topic and a record on the stream maps to a message on the topic. Kafka Streams basically provide an abstraction over the native Kafka Producer and Consumer libraries to enable the developers to get greater functionality with less lines of code. For example:

The above snippet takes an incoming stream of records, doubles the value for each record and puts it onto the output topic. Now, to accomplish the same with Kafka Producer and Consumer API, we would have to write at least 5x more lines of code. The operation being performed in the method above is stateless i.e. at no point in time, are we actually storing the state of the records. All that we are really doing is, taking a records, transforming it and sending it forward. But, Kafka Streams also provides us with something called state-stores.

These state-stores, provided by Kafka Streams can be used to store and query data. These become really significant when we are dealing with stateful operations. Kafka Streams DSL automatically creates and manages these state-stores when an aggregate method like join() or aggregate() is called on a KStream or when a KStream is windowed. The way these state-stores work is really interesting.

These state-stores leverage RocksDB as a non fault-tolerant cache and a kafka changelog topic as a fault-tolerant backup. The data in the state stores are thus, first stored in memory, from where it is flushed to a local file system (local RocksDB instance) , which acts as a local cache and then, after every elapse of the commit interval, data is pushed onto the the kafka state-store changelog topic. The changelog topic is hence, a source of truth in case the application crashes and then restarts. The data from changelog topic is then used to repopulate the local rocksDB instance and hence the state-store.

In case of a multi-partitioned stream, there are threads in the stream-processor application which are assigned multiple tasks, the number of which is equal to the number of partitions. Much like how topics are replicated across brokers, the state-stores are also replicated across partitions. This means that, while each partition maintains its own state-store, it also maintains a replica of the state-store of another partition which can again be used to recover the data in an event when a task crashes.

Using these state-stores for querying data would highly reduce the response-times as the data is being fetched from the local cache instead of an external database server.

要查看或添加评论，请登录

Utkarsh Upendra的更多文章

Understanding HyperLogLog

2020年5月21日

Understanding HyperLogLog

Imagine you go to a party and you want to know how many people are at that party, how would you determine that? One way…

1 条评论
Hash trees (Merkle Trees)

2020年4月19日

Hash trees (Merkle Trees)

Ever wondered how GitHub easily and accurately determines what changes have been made to each file, or how distributed…

2 条评论

Data Management Internals - KStreams and KTables

Utkarsh Upendra

Software Engineer | AT&T | Ex Deliveroo, UHG

Utkarsh Upendra的更多文章

社区洞察

其他会员也浏览了

Creating a Local Data Lakehouse using Spark/Minio/Dremio/Nessie

Change Data Capture (CDC) Events Ingestion

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Kafka Schema Registry

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

Why Delta Lake Is The Most Widely Used Lakehouse Format In The World?

Advanced Data Analytics with Apache’s Cutting-Edge Tools

8 Data Engineering Best Practices for Building a Robust Data Infrastructure

Difference Between Data Lakehouse and Delta Lake

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Utkarsh Upendra的更多文章

Understanding HyperLogLog

Hash trees (Merkle Trees)

社区洞察

其他会员也浏览了

Creating a Local Data Lakehouse using Spark/Minio/Dremio/Nessie

Change Data Capture (CDC) Events Ingestion

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

Kafka Schema Registry

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

Why Delta Lake Is The Most Widely Used Lakehouse Format In The World?

Advanced Data Analytics with Apache’s Cutting-Edge Tools

8 Data Engineering Best Practices for Building a Robust Data Infrastructure

Difference Between Data Lakehouse and Delta Lake

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs