Data Streaming  : Part 1

Data Streaming : Part 1

When I kicked off the streaming project, I thought it'd be a breeze with little to learn. But the more I dug in, the more I realized I had underestimated the complexity. Hence, I decided to share some of the insights and lessons learned.

Complexities

Late Arriving Events

In the real world, dealing with streaming events often means dealing with delays. These delays can crop up due to problems on the producer's side, issues with consumers, glitches in the streaming platform, or timeouts. For applications where the order of events is crucial, like in financial scenarios, these delays can be quite impactful. On the flip side, for other applications where the sequence of events doesn't matter much, using approximation logic can be a practical approach.

Therefore, it's essential to take a step back and really think about the kind of use case we're dealing with before jumping into designing any solution. In my situation, the application I was working on was sensitive, and ensuring the correct order of events was crucial.


Atleast Once versus Atmost Once :

  • In some cases, it's okay to shoot out a message more than once. The priority here is making sure we don't lose any data, even if it means dealing with possible duplicates and maintaining a decent flow of data. How it usually goes down is the consumer shoots back an acknowledgment to the streaming platform, and once they get it, they try not to throw the same event at us again. But, you know, there's that chance the platform might fire off the message twice before getting the nod for the first one.

Atleast Once : Needs Ack from Consumer

  • In some situations, we're aiming to deliver messages only once, and we're fine with losing a few along the way. Duplicates are a strict no-no here. The main goal is to prioritize throughput over absolute correctness. Here's how it plays out: the platform shoots a message to the consumer once. If the consumer grabs it and processes it successfully, great. If not, well, that message is gone, and we move on.

Atmost Once : Fire and Forget strategy

Correctness

For our sensitive application, making sure we've got every single record is the heart of our architecture. You know how it goes – records might slip through the cracks because the producer forgot to hit publish, the consumer couldn't keep up, or the streaming platform has issues, leaving some records stranded in exception category.

Now, enter the Lambda Architecture. The game plan here is simple but effective – we've got two processes on the scene. One taps into real-time streaming data, giving us the live beat of our business, while the other is this ETL Batch pipeline that snags data multiple times a day or when the day wraps up. The catch, though, is that throwing in that extra Batch pipeline ensures we've got 100% of the data, but it does make life a bit more complicated with the whole managing-two-pipelines gig.

Here's the kicker – sometimes the batch pipeline doesn't bring complete data with payload - It just grabs event IDs for a quality check against the streaming data. Even in those situations, we still lean on the Streaming Architecture to replay the events sitting in an exception table for the undelivered stuff. So, if we run into issues with data quality, we have the capability to review and replay those problematic messages.

Lambda Arch : One Streaming Pipeline and another Batch Pipeline.

In next Articles we will discuss more about :

  • Lambda Arch vs Kappa Arch
  • Windowing : Event Time vs Processing Time.
  • Batch ETL strategies to consume Late arriving events.

Archit Sharma

ISB (Class of 2024) | Entrepreneur in Residence & Growth Officer@Smartworks | Ex - Head of Marketing at Smartworks | 10+ yrs. of experience | Scaling Brands | Growth & Founder’s Office | alum - SRCC (Class of 2015)

1 年

Anurag, your article sheds light on the complexities of data streaming projects, and your insights into late-arriving events and delivery strategies are invaluable. Adding to your point on ensuring data completeness, leveraging anomaly detection algorithms could further enhance reliability. Excited for Part 2 and to continue exchanging ideas on optimizing data streaming architectures!

要查看或添加评论,请登录

Anurag Kumawat的更多文章

社区洞察

其他会员也浏览了