Gentle Intro to Data Streaming Landscape

Gentle Intro to Data Streaming Landscape

Learning outcomes from this article:

  • What is stream computing and why should YOU learn about it?
  • Typical streaming architectures in distributed computing
  • What’s Streaming DB / Stream Processors / Real Time OLAP Stores

? What is stream computing and why should YOU learn about it?

As engineers, we should learn technologies that are powering businesses today and will continue to do so in the coming decades. As a corollary to this thought, presenting some biz context on the hotness of stream processing technology:

In last 10 years, there has been an advent of multiple businesses in following domains:

Travel Booking - Booking.com / MakeMyTrip / 谷歌 -Flights

No alt text provided for this image


1 common essence across the fabric of above companies is:

? They all react to events generated in their systems in real time!

Events can be contextual based on the nature of the product. For example:

  • booking request
  • real time position of delivery person
  • real time fluctuation in stock price
  • real time detection of financial transactions etc.

With the above introduction, we can digest the fact that real time processing of events is at the heart of the technical infrastructure of these companies.A formal term that experienced software engineers have invented is Stream Processing.

? Let's use divide and conquer to learn what it means.

Stream - is a continuous flow of data / events. In most cases unbounded.

You might have heard of batch processing (basically what typical MapReduce architectures enable you to do). Wherein, you have fixed sizes of data being processed at regular cadence. Example: Processing all the data that arrives at the day boundary by firstly storing it for 24 hours and then passing it through the compute system to obtain results.????????????????????????

No alt text provided for this image

Now think of stream processing as processing this data in ultra small micro-batches and feeding it to the compute as soon as it arrives. Example: Stripe while building its real time fraud detection ML systems, as talked about here (link-to-stripe-article), they need low sub-milliseconds answers to whether a financial transaction was fraudulent or not.

Hence stream processing technique is an architectural paradigm to help you achieve that !!

If you're a newbie to stream processing, consider giving it some time. At first things seem very similar between batch processing and stream processing, but when you delve deeper into the architecture, state management, compute, nature of the serving layer, that's when you will start appreciating the intricacies. And we all know, the devil lies in the details :)

? Typical Real Time Data Streaming Architecture in Industry.

Now that we know a little about stream processing. Let's double click on typical architecture used in industry to achieve this.

No alt text provided for this image


Please note, building an organization's architecture is always a journey. Hence a day-1 architecture almost would never demand too many components. Simple Google Firebase with a Node Client should be enough to serve your initial use-case.

No alt text provided for this image
KISS !!

???????????????????????????????????????????????

Only when you think deeply about the problem based on the factors I elaborate in the last section of this post, and you realize that your org needs to go through the paradigm shift of moving to real time streaming architecture, should you try to use the following components.

No alt text provided for this image
This is a typical vanilla flavor of a data streaming architecture.

? I'll briefly talk about the different components used here.

  • The data in your organization can reside in different types of databases and Kafka is typically one of the most reliable components to act as a highway to connect your data to other data systems. In this case, we call the place your data initially resides as source system and eventual destination as sink systems.
  • Kafka Connect is a reliable, distributed framework that helps you build connectors for achieving this feat. You can read more about it here.?

Once you have Kafka in picture, then you need a system to process your data in real time. By processing - I mean, you should be able to write jobs to slice and dice the data stream for real time results.

Such a feat can be achieved by having Apache Flink or Kafka Streams in place. At a very crude level, think of them as real time, low latency Hadoop systems (it's not the BEST analogy, but fine to begin with :-) ). I'll explain about them in the next heading of this post itself.

Now that you have processed your data and computed output tables from it using Apache Flink / Kafka Streams.You need to start thinking of a serving layer which would act as a touch point for your users / clients to interact with the output of the stream jobs that you ran in Flink etc.

For such a scenario, if your users / clients - would issue transactional queries over the output, then a typical Postgres-like datastore can serve you well.

In case you expect your users / clients - would issue analytical queries over the output of your stream jobs, then it again breaks things up into 2 cases:

? External User Facing Analytics: This is the paradigm where you foresee thousands or hundreds of thousands of users issuing analytical queries on your system for sub-millisecond latency results. For such use cases, technologies like Apache Pinot / Rockset, really do the job well because of their highly optimized indexing layer and micro-sharded storage to reduce the number of rows they query on their data system to serve you results.

? Internal User Facing Analytics: If you foresee your own employees running analytical queries on top of the output from your streaming jobs, then typically @Apache Druid is a battle-tested system which helps you achieve this feat. Can give it a read here.

No alt text provided for this image

?????????????????????????????

What if I told you that you could combine the power of:

  • Stream Processing Systems like Flink, KStreams
  • Touch Point / Serving Layer for users - Apache Pinot, Apache Druid

in a single system, which can help you run stream processing jobs and at the same time, persist that data for easier debugging.

Such kinds of systems that, at a very high level, enable you to run stream processing jobs while providing a database kind of touchpoint for both your streaming jobs and serving layer.

? What's the difference between Stream Processors / Real Time OLAP Stores / Streaming Databases !!

Having mentioned so many systems here, you might be tempted to throw away your laptop and question your choice of career :)

No alt text provided for this image

????????????????????

  1. Stream Processing: A stream processing application is a DAG (Direct Acyclic Graph), where each node is a processing step. You write a DAG by writing individual processing functions that perform operations when data flow passes through them. These functions can be stateless operations like transformations or filtering or stateful operations like aggregations that remember the result of the previous execution. Stateful stream processing is used to derive insights from streaming data.
  2. Streaming Databases: This technology can ingest data from streaming data sources and make them available for querying immediately. It extends the stateful stream processing and bring additional features from the database world, such as columnar file formats, indexing, incrementally updated materialized views, and scatter-gather query execution.

In short, streaming databases treat storage as first class citizens for stream processing use cases. Needless to mention, how useful a persistent state is. But also costly? (wait till you hear about the compute-storage separation design choice).

An interesting open source and fully managed service that I recently came across is RisingWave Labs, which offers a streaming database built on top of Rust !! Yes, you read that right :)

3. Real Time OLAP Datastores: These are highly optimized columnar data storage platforms which pre-compute a range of indexes on top of your data to serve your users with sub-millisecond latency queries at scale. You can check out this article of mine, to know more about this technology. <Insert Link Here.>


? Some things that I liked about streaming database technology while analyzing the open source project RisingWave.???

No alt text provided for this image

  • It combines the power of the stream processing layer with the serving layer, in 1 system, it reduces complexities of 2 systems. If you've worked with distributed systems in production you'll really appreciate the benefit.
  • Having storage as the core offering, enables easier debugging of streaming jobs, which is slightly nuanced when it comes to Apache Flink, KStreams etc.
  • Postgres like SQL semantics for querying data - The onboarding time for an engineer to interact with Streaming DB is literally zero because every engineer knows SQL semantics and that's exactly how you can orchestrate streaming jobs in this tech and also access the results of those computations.
  • Ad-hoc nature of queries - When you write an ETL pipeline using a Flink / KStream job, you typically deploy it to PROD and then let it run on the continuous flow of data. However, if you're on call and need to debug issues in your job, then you need to run ad-hoc jobs for validations etc. And Streaming DB is a great choice for that. You can set up BI systems and dashboarding on top of this layer. The possibilities are just infinite :P
  • Decoupled Storage and Compute - The biggest benefit of the cloud native era, is to pay for what you use on a consumption based model. After going through the design documents present on Rising Wave open source repository, it's apparent that this separation between storage and compute literally provides you infinite scalability with economies of scale and just shell your dollars for what's necessary. There are shortcomings to this as well. For sudden spikes in your system, if compute is required but isn't available readily, then systems will have to auto-scale and eventually that latency can cause issues in your downstream services, though it's a nuanced detail and can have an entire article on its own :)
  • Performance - We've been talking about latency all the way thru this article, hence makes sense to benchmark perf against standard systems out there. The website claims that RisingWave is 10-50% faster than Apache Flink !! Though they are yet to publish a benchmarking report and am eagerly awaiting the same :)


This was a gentle introduction to understanding the world of stream processing and technologies that power the same. Stay tuned, for my next article where we dissect the questions to consider while choosing one of these technologies for your infrastructure.

要查看或添加评论,请登录

Nikhil Srivastava的更多文章

社区洞察

其他会员也浏览了