Data Streaming and Message Brokers
Saurav Prateek
Engineer @ Google | Ex-SWE @ GeeksForGeeks | Authoring engineering newsletter with 30K+ Subs | 60K+ Linkedin | Content Creator | Mentor
Idea of Data Streaming
In cases where the input is finite or bounded, we used Batch Processing to produce the output data. The data is derived since it can be created again by running the batch process on the same set of finite inputs.
In the real world, this might not be the case. The input may be unbounded. The input data arrives gradually over time and one can never predict if the input data is complete. One example can be the reading of temperature from a particular location at a regular interval. Here, if your system is up and running then the temperature readings will be received throughout.
There are multiple ways in which we can handle this situation.
Case 1:
Using Batch processing to process the input at a regular interval. For example: Running the batch process to consume a day’s worth of temperature readings at the end of every day. In that case there will be a delay in processing of the data. It might take at-max 24 hours for a user to get the temperature readings of a day.
We can improve this by running our Batch process less frequently. Maybe consuming an hour worth of data?
Case 2:
On further improving the previous case, we can process the temperature readings frequently or entirely remove the time interval and process the readings instantly at the time it originated. This is the idea behind the Stream Processing.
Polling
Let’s understand one possible method of streaming data from one system to another. For this, we will take the previous example of periodic generation of the temperature readings for a specific location.
In stream processing the data is streamed in a form of sequence of Events. An Event is an immutable object containing a timestamp indicating when it happened along with the event data.
Now, these events are sent from the Producers to the Consumers. In this process the Producer writes an event to a datastore and each consumer periodically polls the data-store to check for the presence of the event. This process of transfer of events from producer to consumer is called Polling.
Polling is an expensive process. More the network calls made by the consumers to data-store, lesser the percentage of calls that returns the desired event. This causes multiple calls to be wasted causing over-heads.
A smarter approach would be for the data-store to notify the consumers when the event has been made available. This approach is followed by many traditional Messaging Systems.
Message Brokers
In the previous section we briefly touched upon the push-based system for streaming the data. Message Brokers are used to send a stream of events from the producer to consumers through a push-based mechanism.
Message broker runs as a server, with producer and consumer connecting to it as clients. Producers can write events to the broker and consumers receive them from the broker.
Let’s take a look on some characteristics of the Message Brokers:
领英推荐
When multiple Consumers read events from the Message Broker then it can happen in two ways:
1. Load Balancing
In this process the Message Broker aims to parallelize the processing by assigning events to the consumers randomly. Each event is assigned to a Consumer.
It can be used in the cases where each Event is expensive to process.
2. Fan Out
In this process every event in the Message Queue is delivered to all the consumers. This process allows every consumer to connect to the Message Broker and function independently.
The Load Balancing and Fan Out process can be combined together to perform streaming for multiple groups of consumers. Let’s say there are 2 groups of consumers, each having 3 consumers within. We can use the Fan-Out process at the group level so that both the Consumer groups receive every event. Further we can use the Load Balancer method to parallelize the event-processing within every consumer group.
Fault Tolerance
There may be a case where consumers may crash at any point of time. In those situations a Message Broker may send an event to a consumer for processing and it partially processes it before crashing.
To ensure the message is not lost, the message brokers use Acknowledgements. A consumer must explicitly tell the message broker once it finished processing an Event so that the broker can permanently remove it from the queue.
If the consumer crashes after partially processing an event and without sending an acknowledgement to the broker, the message broker assumes that the event was not processed at all and further Re-deliver the event to a different consumer.
The crashing of Consumer 1 in the middle has an impact in the ordering of events consumed by the consumers. We can see that event E3 was consumed after event E4, since Consumer 1 crashed earlier while consuming E3.
To avoid reordering of the events, we can use separate queues per consumer.
Conclusion
We discussed the idea of Data Streaming and also discussed multiple ways to achieve the same. We also looked around the concept of Message Brokers and how they stream the data from producers to consumers through a push-based mechanism.
Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.
Until next time, Dive Deep and Keep Learning!
Software Development Engineer 2 at tanla platforms | DSA | LLD | HLD | JAVA | SPRING BOOT | REACT JS
10 个月Very clear and crisp
SDE2 @Amazon | Mentor | Big Omega Creator | @codedecks
2 年Keep going brother ??
Software Engineer | Allen | Ex- Zomato, Fi | Competitive Programmer ???? | Building scalable back-end applications
2 年Very well explained !!