Data Streaming and Message Brokers

Data Streaming and Message Brokers

Idea of Data Streaming

In cases where the input is finite or bounded, we used Batch Processing to produce the output data. The data is derived since it can be created again by running the batch process on the same set of finite inputs.

In the real world, this might not be the case. The input may be unbounded. The input data arrives gradually over time and one can never predict if the input data is complete. One example can be the reading of temperature from a particular location at a regular interval. Here, if your system is up and running then the temperature readings will be received throughout.

There are multiple ways in which we can handle this situation.

Case 1:

Using Batch processing to process the input at a regular interval. For example: Running the batch process to consume a day’s worth of temperature readings at the end of every day. In that case there will be a delay in processing of the data. It might take at-max 24 hours for a user to get the temperature readings of a day.

We can improve this by running our Batch process less frequently. Maybe consuming an hour worth of data?

Case 2:

On further improving the previous case, we can process the temperature readings frequently or entirely remove the time interval and process the readings instantly at the time it originated. This is the idea behind the Stream Processing.


Polling

Let’s understand one possible method of streaming data from one system to another. For this, we will take the previous example of periodic generation of the temperature readings for a specific location.

In stream processing the data is streamed in a form of sequence of Events. An Event is an immutable object containing a timestamp indicating when it happened along with the event data.

Now, these events are sent from the Producers to the Consumers. In this process the Producer writes an event to a datastore and each consumer periodically polls the data-store to check for the presence of the event. This process of transfer of events from producer to consumer is called Polling.


No alt text provided for this image


Polling is an expensive process. More the network calls made by the consumers to data-store, lesser the percentage of calls that returns the desired event. This causes multiple calls to be wasted causing over-heads.

A smarter approach would be for the data-store to notify the consumers when the event has been made available. This approach is followed by many traditional Messaging Systems.


No alt text provided for this image


Message Brokers

In the previous section we briefly touched upon the push-based system for streaming the data. Message Brokers are used to send a stream of events from the producer to consumers through a push-based mechanism.

Message broker runs as a server, with producer and consumer connecting to it as clients. Producers can write events to the broker and consumers receive them from the broker.


No alt text provided for this image


Let’s take a look on some characteristics of the Message Brokers:

  1. Message brokers are not suitable for long-term data storage. They delete the events as soon as it has been successfully delivered to the consumers.
  2. Message brokers generally notify the consumer when there has been a data change.
  3. Since message brokers don't store events for a long-term, hence their queue is short.
  4. Message brokers may also write their data to an external disk so that they are not lost incase of a crash.

When multiple Consumers read events from the Message Broker then it can happen in two ways:

1. Load Balancing

In this process the Message Broker aims to parallelize the processing by assigning events to the consumers randomly. Each event is assigned to a Consumer.

It can be used in the cases where each Event is expensive to process.


No alt text provided for this image


2. Fan Out

In this process every event in the Message Queue is delivered to all the consumers. This process allows every consumer to connect to the Message Broker and function independently.


No alt text provided for this image


The Load Balancing and Fan Out process can be combined together to perform streaming for multiple groups of consumers. Let’s say there are 2 groups of consumers, each having 3 consumers within. We can use the Fan-Out process at the group level so that both the Consumer groups receive every event. Further we can use the Load Balancer method to parallelize the event-processing within every consumer group.


No alt text provided for this image


Fault Tolerance

There may be a case where consumers may crash at any point of time. In those situations a Message Broker may send an event to a consumer for processing and it partially processes it before crashing.

To ensure the message is not lost, the message brokers use Acknowledgements. A consumer must explicitly tell the message broker once it finished processing an Event so that the broker can permanently remove it from the queue.

If the consumer crashes after partially processing an event and without sending an acknowledgement to the broker, the message broker assumes that the event was not processed at all and further Re-deliver the event to a different consumer.


No alt text provided for this image


The crashing of Consumer 1 in the middle has an impact in the ordering of events consumed by the consumers. We can see that event E3 was consumed after event E4, since Consumer 1 crashed earlier while consuming E3.

To avoid reordering of the events, we can use separate queues per consumer.


Conclusion

We discussed the idea of Data Streaming and also discussed multiple ways to achieve the same. We also looked around the concept of Message Brokers and how they stream the data from producers to consumers through a push-based mechanism.

Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.

Until next time, Dive Deep and Keep Learning!

K Uma Maheswara Rao

Software Development Engineer 2 at tanla platforms | DSA | LLD | HLD | JAVA | SPRING BOOT | REACT JS

10 个月

Very clear and crisp

Gourav Rusiya

SDE2 @Amazon | Mentor | Big Omega Creator | @codedecks

2 年

Keep going brother ??

Akhand Agarwal

Software Engineer | Allen | Ex- Zomato, Fi | Competitive Programmer ???? | Building scalable back-end applications

2 年

Very well explained !!

要查看或添加评论,请登录

Saurav Prateek的更多文章

社区洞察

其他会员也浏览了