Data Management Internals - KStreams and KTables
Kafka Streams are a great way of building stream processing applications. Built on top of Kafka Producer and Consumer libraries, the Kafka Streams library leverages all the capabilities natively offered by Kafka like horizontal scalability, parallel data processing, fault-tolerance and much more. In this article, I am going to cover more about how do Kafka Streams work internally.
Since Kafka Streams library is derived from Kafka Producer and consumer libraries, the way it processes data isn't much different. A partition on the stream maps to a partition on the topic and a record on the stream maps to a message on the topic. Kafka Streams basically provide an abstraction over the native Kafka Producer and Consumer libraries to enable the developers to get greater functionality with less lines of code. For example:
The above snippet takes an incoming stream of records, doubles the value for each record and puts it onto the output topic. Now, to accomplish the same with Kafka Producer and Consumer API, we would have to write at least 5x more lines of code. The operation being performed in the method above is stateless i.e. at no point in time, are we actually storing the state of the records. All that we are really doing is, taking a records, transforming it and sending it forward. But, Kafka Streams also provides us with something called state-stores.
These state-stores, provided by Kafka Streams can be used to store and query data. These become really significant when we are dealing with stateful operations. Kafka Streams DSL automatically creates and manages these state-stores when an aggregate method like join() or aggregate() is called on a KStream or when a KStream is windowed. The way these state-stores work is really interesting.
These state-stores leverage RocksDB as a non fault-tolerant cache and a kafka changelog topic as a fault-tolerant backup. The data in the state stores are thus, first stored in memory, from where it is flushed to a local file system (local RocksDB instance) , which acts as a local cache and then, after every elapse of the commit interval, data is pushed onto the the kafka state-store changelog topic. The changelog topic is hence, a source of truth in case the application crashes and then restarts. The data from changelog topic is then used to repopulate the local rocksDB instance and hence the state-store.
In case of a multi-partitioned stream, there are threads in the stream-processor application which are assigned multiple tasks, the number of which is equal to the number of partitions. Much like how topics are replicated across brokers, the state-stores are also replicated across partitions. This means that, while each partition maintains its own state-store, it also maintains a replica of the state-store of another partition which can again be used to recover the data in an event when a task crashes.
Using these state-stores for querying data would highly reduce the response-times as the data is being fetched from the local cache instead of an external database server.