登录查看更多内容

Gentle Intro to Data Streaming Landscape

Nikhil Srivastava

Senior Software Engineer at Confluent

发布日期: 2023年5月31日

+ 关注

Learning outcomes from this article:

What is stream computing and why should YOU learn about it?
Typical streaming architectures in distributed computing
What’s Streaming DB / Stream Processors / Real Time OLAP Stores

? What is stream computing and why should YOU learn about it?

As engineers, we should learn technologies that are powering businesses today and will continue to do so in the coming decades. As a corollary to this thought, presenting some biz context on the hotness of stream processing technology:

In last 10 years, there has been an advent of multiple businesses in following domains:

Travel Booking - Booking.com / MakeMyTrip / 谷歌 -Flights

Quick Delivery Apps - Blinkit / Zepto / instamart
Stock Trading / Fintech - Zerodha / Groww / Stripe
Fin-Tech - PhonePe / BharatPe / Paytm
ML Training - Databricks
Inventory Tracking - RIVIGO / Uber Freight -freight?????

1 common essence across the fabric of above companies is:

? They all react to events generated in their systems in real time!

Events can be contextual based on the nature of the product. For example:

booking request
real time position of delivery person
real time fluctuation in stock price
real time detection of financial transactions etc.

With the above introduction, we can digest the fact that real time processing of events is at the heart of the technical infrastructure of these companies.A formal term that experienced software engineers have invented is Stream Processing.

? Let's use divide and conquer to learn what it means.

Stream - is a continuous flow of data / events. In most cases unbounded.

You might have heard of batch processing (basically what typical MapReduce architectures enable you to do). Wherein, you have fixed sizes of data being processed at regular cadence. Example: Processing all the data that arrives at the day boundary by firstly storing it for 24 hours and then passing it through the compute system to obtain results.????????????????????????

Now think of stream processing as processing this data in ultra small micro-batches and feeding it to the compute as soon as it arrives. Example: Stripe while building its real time fraud detection ML systems, as talked about here (link-to-stripe-article), they need low sub-milliseconds answers to whether a financial transaction was fraudulent or not.

Hence stream processing technique is an architectural paradigm to help you achieve that !!

If you're a newbie to stream processing, consider giving it some time. At first things seem very similar between batch processing and stream processing, but when you delve deeper into the architecture, state management, compute, nature of the serving layer, that's when you will start appreciating the intricacies. And we all know, the devil lies in the details :)

? Typical Real Time Data Streaming Architecture in Industry.

Now that we know a little about stream processing. Let's double click on typical architecture used in industry to achieve this.

Please note, building an organization's architecture is always a journey. Hence a day-1 architecture almost would never demand too many components. Simple Google Firebase with a Node Client should be enough to serve your initial use-case.

???????????????????????????????????????????????

Only when you think deeply about the problem based on the factors I elaborate in the last section of this post, and you realize that your org needs to go through the paradigm shift of moving to real time streaming architecture, should you try to use the following components.

领英推荐

SwiftKV: Accelerating Enterprise LLM Workloads with…

Snowflake 2 个月前

The Database Powering Zepto’s 10-Minute Delivery

AIM Events 1 个月前

Databricks Advances Spark Structured Streaming With…

Speedb 1 年前

? I'll briefly talk about the different components used here.

The data in your organization can reside in different types of databases and Kafka is typically one of the most reliable components to act as a highway to connect your data to other data systems. In this case, we call the place your data initially resides as source system and eventual destination as sink systems.
Kafka Connect is a reliable, distributed framework that helps you build connectors for achieving this feat. You can read more about it here.?

Once you have Kafka in picture, then you need a system to process your data in real time. By processing - I mean, you should be able to write jobs to slice and dice the data stream for real time results.

Such a feat can be achieved by having Apache Flink or Kafka Streams in place. At a very crude level, think of them as real time, low latency Hadoop systems (it's not the BEST analogy, but fine to begin with :-) ). I'll explain about them in the next heading of this post itself.

Now that you have processed your data and computed output tables from it using Apache Flink / Kafka Streams.You need to start thinking of a serving layer which would act as a touch point for your users / clients to interact with the output of the stream jobs that you ran in Flink etc.

For such a scenario, if your users / clients - would issue transactional queries over the output, then a typical Postgres-like datastore can serve you well.

In case you expect your users / clients - would issue analytical queries over the output of your stream jobs, then it again breaks things up into 2 cases:

? External User Facing Analytics: This is the paradigm where you foresee thousands or hundreds of thousands of users issuing analytical queries on your system for sub-millisecond latency results. For such use cases, technologies like Apache Pinot / Rockset, really do the job well because of their highly optimized indexing layer and micro-sharded storage to reduce the number of rows they query on their data system to serve you results.

? Internal User Facing Analytics: If you foresee your own employees running analytical queries on top of the output from your streaming jobs, then typically @Apache Druid is a battle-tested system which helps you achieve this feat. Can give it a read here.

?????????????????????????????

What if I told you that you could combine the power of:

Stream Processing Systems like Flink, KStreams
Touch Point / Serving Layer for users - Apache Pinot, Apache Druid

in a single system, which can help you run stream processing jobs and at the same time, persist that data for easier debugging.

Such kinds of systems that, at a very high level, enable you to run stream processing jobs while providing a database kind of touchpoint for both your streaming jobs and serving layer.

? What's the difference between Stream Processors / Real Time OLAP Stores / Streaming Databases !!

Having mentioned so many systems here, you might be tempted to throw away your laptop and question your choice of career :)

????????????????????

Stream Processing: A stream processing application is a DAG (Direct Acyclic Graph), where each node is a processing step. You write a DAG by writing individual processing functions that perform operations when data flow passes through them. These functions can be stateless operations like transformations or filtering or stateful operations like aggregations that remember the result of the previous execution. Stateful stream processing is used to derive insights from streaming data.
Streaming Databases: This technology can ingest data from streaming data sources and make them available for querying immediately. It extends the stateful stream processing and bring additional features from the database world, such as columnar file formats, indexing, incrementally updated materialized views, and scatter-gather query execution.

In short, streaming databases treat storage as first class citizens for stream processing use cases. Needless to mention, how useful a persistent state is. But also costly? (wait till you hear about the compute-storage separation design choice).

An interesting open source and fully managed service that I recently came across is RisingWave Labs, which offers a streaming database built on top of Rust !! Yes, you read that right :)

3. Real Time OLAP Datastores: These are highly optimized columnar data storage platforms which pre-compute a range of indexes on top of your data to serve your users with sub-millisecond latency queries at scale. You can check out this article of mine, to know more about this technology. <Insert Link Here.>

? Some things that I liked about streaming database technology while analyzing the open source project RisingWave.???

It combines the power of the stream processing layer with the serving layer, in 1 system, it reduces complexities of 2 systems. If you've worked with distributed systems in production you'll really appreciate the benefit.
Having storage as the core offering, enables easier debugging of streaming jobs, which is slightly nuanced when it comes to Apache Flink, KStreams etc.
Postgres like SQL semantics for querying data - The onboarding time for an engineer to interact with Streaming DB is literally zero because every engineer knows SQL semantics and that's exactly how you can orchestrate streaming jobs in this tech and also access the results of those computations.
Ad-hoc nature of queries - When you write an ETL pipeline using a Flink / KStream job, you typically deploy it to PROD and then let it run on the continuous flow of data. However, if you're on call and need to debug issues in your job, then you need to run ad-hoc jobs for validations etc. And Streaming DB is a great choice for that. You can set up BI systems and dashboarding on top of this layer. The possibilities are just infinite :P
Decoupled Storage and Compute - The biggest benefit of the cloud native era, is to pay for what you use on a consumption based model. After going through the design documents present on Rising Wave open source repository, it's apparent that this separation between storage and compute literally provides you infinite scalability with economies of scale and just shell your dollars for what's necessary. There are shortcomings to this as well. For sudden spikes in your system, if compute is required but isn't available readily, then systems will have to auto-scale and eventually that latency can cause issues in your downstream services, though it's a nuanced detail and can have an entire article on its own :)
Performance - We've been talking about latency all the way thru this article, hence makes sense to benchmark perf against standard systems out there. The website claims that RisingWave is 10-50% faster than Apache Flink !! Though they are yet to publish a benchmarking report and am eagerly awaiting the same :)

This was a gentle introduction to understanding the world of stream processing and technologies that power the same. Stay tuned, for my next article where we dissect the questions to consider while choosing one of these technologies for your infrastructure.

要查看或添加评论，请登录

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

2023年6月9日

Apache KAFKA Connect 101 - Part (1/2)

?? Learning outcomes from this article: ? Typical Kafka Usage with Producer and Consumer Clients: ? Kafka Connect 101:…
Design of Real Time Analytics for 69x Price Hike in Uber !

2022年9月18日

Design of Real Time Analytics for 69x Price Hike in Uber !

?? We'll explore the engineering design philosophy behind features like: ?? The 69x surge-pricing while booking Uber…

7 条评论
Kafka Replication Protocol

2022年9月12日

Kafka Replication Protocol

In this article, I'm going to explain you the Kafka Data Replication Protocol with a very simple example. [?] Learning…

9 条评论
ZERO to HERO in 5 minutes in Apache KAFKA

2022年9月4日

ZERO to HERO in 5 minutes in Apache KAFKA

[?] Learning Outcomes From this article: ?? Introduction to Kafka via Bangalore Traffic Analogy ?? Producers…

9 条评论
Ripping apart the Kafka Broker Architecture !!

2022年8月28日

Ripping apart the Kafka Broker Architecture !!

?? Just like you, even I love distributed systems. Apache Kafka is such a beautifully designed distributed system that,…

11 条评论
Damn, Threads in web server too !!

2022年8月21日

Damn, Threads in web server too !!

?? Chronological learning outcomes from this article. Important low level and high level details about the functioning…

17 条评论
Ditch Multi-Threading !! Learn CPU and memory architecture first

2022年8月15日

Ditch Multi-Threading !! Learn CPU and memory architecture first

?? Don't risk your concurrency interviews without having good grasp on fundamentals. Read further and become a better…

14 条评论
Kubernetes-101 as a noob, from a noob, like a noob !!

2022年8月13日

Kubernetes-101 as a noob, from a noob, like a noob !!

?? My intention with this article is to help you form some intuition about Kubernetes, Containers, VMs etc. So that you…

14 条评论
Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

2022年8月7日

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

? Now that I have your attention, if you're preparing for LLD interviews, read further to solidify your understanding…

7 条评论

See all articles

Gentle Intro to Data Streaming Landscape

Nikhil Srivastava

Senior Software Engineer at Confluent

? What is stream computing and why should YOU learn about it?

? Typical Real Time Data Streaming Architecture in Industry.

领英推荐

Nikhil Srivastava的更多文章

社区洞察

其他会员也浏览了

Machine Learning

Inside the Loop: Save Big on Log Aggregation with Amazon S3’s S3Tar

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

Serverless Machine Learning: Redefining the Infrastructure of AI Development

AWS Kinesis Data Firehose -Seamless and Scalable Data Ingestion for Real-Time Insights EP:15

Automating Binary Classification Model Building with Amazon SageMaker Autopilot

H2O.ai is Building Smaller AI Models

DataTunerX

Decentralized AI Architectures: Running Distributed Agent Networks

The Future of Serverless AI Compute: Accelerating Business Innovation and Streamlining Application Development

? What is stream computing and why should YOU learn about it?

? Typical Real Time Data Streaming Architecture in Industry.

领英推荐

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

Design of Real Time Analytics for 69x Price Hike in Uber !

Kafka Replication Protocol

ZERO to HERO in 5 minutes in Apache KAFKA

Ripping apart the Kafka Broker Architecture !!

Damn, Threads in web server too !!

Ditch Multi-Threading !! Learn CPU and memory architecture first

Kubernetes-101 as a noob, from a noob, like a noob !!

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

社区洞察

其他会员也浏览了

Machine Learning

Inside the Loop: Save Big on Log Aggregation with Amazon S3’s S3Tar

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

Serverless Machine Learning: Redefining the Infrastructure of AI Development

AWS Kinesis Data Firehose -Seamless and Scalable Data Ingestion for Real-Time Insights EP:15

Automating Binary Classification Model Building with Amazon SageMaker Autopilot

H2O.ai is Building Smaller AI Models

DataTunerX

Decentralized AI Architectures: Running Distributed Agent Networks

The Future of Serverless AI Compute: Accelerating Business Innovation and Streamlining Application Development