Data Management Internals - KStreams and KTables

Kafka Streams are a great way of building stream processing applications. Built on top of Kafka Producer and Consumer libraries, the Kafka Streams library leverages all the capabilities natively offered by Kafka like horizontal scalability, parallel data processing, fault-tolerance and much more. In this article, I am going to cover more about how do Kafka Streams work internally.

Since Kafka Streams library is derived from Kafka Producer and consumer libraries, the way it processes data isn't much different. A partition on the stream maps to a partition on the topic and a record on the stream maps to a message on the topic. Kafka Streams basically provide an abstraction over the native Kafka Producer and Consumer libraries to enable the developers to get greater functionality with less lines of code. For example:

No alt text provided for this image

The above snippet takes an incoming stream of records, doubles the value for each record and puts it onto the output topic. Now, to accomplish the same with Kafka Producer and Consumer API, we would have to write at least 5x more lines of code. The operation being performed in the method above is stateless i.e. at no point in time, are we actually storing the state of the records. All that we are really doing is, taking a records, transforming it and sending it forward. But, Kafka Streams also provides us with something called state-stores.

These state-stores, provided by Kafka Streams can be used to store and query data. These become really significant when we are dealing with stateful operations. Kafka Streams DSL automatically creates and manages these state-stores when an aggregate method like join() or aggregate() is called on a KStream or when a KStream is windowed. The way these state-stores work is really interesting.

These state-stores leverage RocksDB as a non fault-tolerant cache and a kafka changelog topic as a fault-tolerant backup. The data in the state stores are thus, first stored in memory, from where it is flushed to a local file system (local RocksDB instance) , which acts as a local cache and then, after every elapse of the commit interval, data is pushed onto the the kafka state-store changelog topic. The changelog topic is hence, a source of truth in case the application crashes and then restarts. The data from changelog topic is then used to repopulate the local rocksDB instance and hence the state-store.

In case of a multi-partitioned stream, there are threads in the stream-processor application which are assigned multiple tasks, the number of which is equal to the number of partitions. Much like how topics are replicated across brokers, the state-stores are also replicated across partitions. This means that, while each partition maintains its own state-store, it also maintains a replica of the state-store of another partition which can again be used to recover the data in an event when a task crashes.

Using these state-stores for querying data would highly reduce the response-times as the data is being fetched from the local cache instead of an external database server.


要查看或添加评论,请登录

Utkarsh Upendra的更多文章

  • Understanding HyperLogLog

    Understanding HyperLogLog

    Imagine you go to a party and you want to know how many people are at that party, how would you determine that? One way…

    1 条评论
  • Hash trees (Merkle Trees)

    Hash trees (Merkle Trees)

    Ever wondered how GitHub easily and accurately determines what changes have been made to each file, or how distributed…

    2 条评论

社区洞察

其他会员也浏览了