Gentle Intro to Data Streaming Landscape
Learning outcomes from this article:
? What is stream computing and why should YOU learn about it?
As engineers, we should learn technologies that are powering businesses today and will continue to do so in the coming decades. As a corollary to this thought, presenting some biz context on the hotness of stream processing technology:
In last 10 years, there has been an advent of multiple businesses in following domains:
Travel Booking - Booking.com / MakeMyTrip / 谷歌 -Flights
1 common essence across the fabric of above companies is:
? They all react to events generated in their systems in real time!
Events can be contextual based on the nature of the product. For example:
With the above introduction, we can digest the fact that real time processing of events is at the heart of the technical infrastructure of these companies.A formal term that experienced software engineers have invented is Stream Processing.
? Let's use divide and conquer to learn what it means.
Stream - is a continuous flow of data / events. In most cases unbounded.
You might have heard of batch processing (basically what typical MapReduce architectures enable you to do). Wherein, you have fixed sizes of data being processed at regular cadence. Example: Processing all the data that arrives at the day boundary by firstly storing it for 24 hours and then passing it through the compute system to obtain results.????????????????????????
Now think of stream processing as processing this data in ultra small micro-batches and feeding it to the compute as soon as it arrives. Example: Stripe while building its real time fraud detection ML systems, as talked about here (link-to-stripe-article), they need low sub-milliseconds answers to whether a financial transaction was fraudulent or not.
Hence stream processing technique is an architectural paradigm to help you achieve that !!
If you're a newbie to stream processing, consider giving it some time. At first things seem very similar between batch processing and stream processing, but when you delve deeper into the architecture, state management, compute, nature of the serving layer, that's when you will start appreciating the intricacies. And we all know, the devil lies in the details :)
? Typical Real Time Data Streaming Architecture in Industry.
Now that we know a little about stream processing. Let's double click on typical architecture used in industry to achieve this.
Please note, building an organization's architecture is always a journey. Hence a day-1 architecture almost would never demand too many components. Simple Google Firebase with a Node Client should be enough to serve your initial use-case.
???????????????????????????????????????????????
Only when you think deeply about the problem based on the factors I elaborate in the last section of this post, and you realize that your org needs to go through the paradigm shift of moving to real time streaming architecture, should you try to use the following components.
领英推荐
? I'll briefly talk about the different components used here.
Once you have Kafka in picture, then you need a system to process your data in real time. By processing - I mean, you should be able to write jobs to slice and dice the data stream for real time results.
Such a feat can be achieved by having Apache Flink or Kafka Streams in place. At a very crude level, think of them as real time, low latency Hadoop systems (it's not the BEST analogy, but fine to begin with :-) ). I'll explain about them in the next heading of this post itself.
Now that you have processed your data and computed output tables from it using Apache Flink / Kafka Streams.You need to start thinking of a serving layer which would act as a touch point for your users / clients to interact with the output of the stream jobs that you ran in Flink etc.
For such a scenario, if your users / clients - would issue transactional queries over the output, then a typical Postgres-like datastore can serve you well.
In case you expect your users / clients - would issue analytical queries over the output of your stream jobs, then it again breaks things up into 2 cases:
? External User Facing Analytics: This is the paradigm where you foresee thousands or hundreds of thousands of users issuing analytical queries on your system for sub-millisecond latency results. For such use cases, technologies like Apache Pinot / Rockset, really do the job well because of their highly optimized indexing layer and micro-sharded storage to reduce the number of rows they query on their data system to serve you results.
? Internal User Facing Analytics: If you foresee your own employees running analytical queries on top of the output from your streaming jobs, then typically @Apache Druid is a battle-tested system which helps you achieve this feat. Can give it a read here.
?????????????????????????????
What if I told you that you could combine the power of:
in a single system, which can help you run stream processing jobs and at the same time, persist that data for easier debugging.
Such kinds of systems that, at a very high level, enable you to run stream processing jobs while providing a database kind of touchpoint for both your streaming jobs and serving layer.
? What's the difference between Stream Processors / Real Time OLAP Stores / Streaming Databases !!
Having mentioned so many systems here, you might be tempted to throw away your laptop and question your choice of career :)
????????????????????
In short, streaming databases treat storage as first class citizens for stream processing use cases. Needless to mention, how useful a persistent state is. But also costly? (wait till you hear about the compute-storage separation design choice).
An interesting open source and fully managed service that I recently came across is RisingWave Labs, which offers a streaming database built on top of Rust !! Yes, you read that right :)
3. Real Time OLAP Datastores: These are highly optimized columnar data storage platforms which pre-compute a range of indexes on top of your data to serve your users with sub-millisecond latency queries at scale. You can check out this article of mine, to know more about this technology. <Insert Link Here.>
? Some things that I liked about streaming database technology while analyzing the open source project RisingWave.???
This was a gentle introduction to understanding the world of stream processing and technologies that power the same. Stay tuned, for my next article where we dissect the questions to consider while choosing one of these technologies for your infrastructure.