Data Streaming : Part 1
When I kicked off the streaming project, I thought it'd be a breeze with little to learn. But the more I dug in, the more I realized I had underestimated the complexity. Hence, I decided to share some of the insights and lessons learned.
Complexities
Late Arriving Events
In the real world, dealing with streaming events often means dealing with delays. These delays can crop up due to problems on the producer's side, issues with consumers, glitches in the streaming platform, or timeouts. For applications where the order of events is crucial, like in financial scenarios, these delays can be quite impactful. On the flip side, for other applications where the sequence of events doesn't matter much, using approximation logic can be a practical approach.
Therefore, it's essential to take a step back and really think about the kind of use case we're dealing with before jumping into designing any solution. In my situation, the application I was working on was sensitive, and ensuring the correct order of events was crucial.
Atleast Once versus Atmost Once :
Atleast Once : Needs Ack from Consumer
领英推荐
Atmost Once : Fire and Forget strategy
Correctness
For our sensitive application, making sure we've got every single record is the heart of our architecture. You know how it goes – records might slip through the cracks because the producer forgot to hit publish, the consumer couldn't keep up, or the streaming platform has issues, leaving some records stranded in exception category.
Now, enter the Lambda Architecture. The game plan here is simple but effective – we've got two processes on the scene. One taps into real-time streaming data, giving us the live beat of our business, while the other is this ETL Batch pipeline that snags data multiple times a day or when the day wraps up. The catch, though, is that throwing in that extra Batch pipeline ensures we've got 100% of the data, but it does make life a bit more complicated with the whole managing-two-pipelines gig.
Here's the kicker – sometimes the batch pipeline doesn't bring complete data with payload - It just grabs event IDs for a quality check against the streaming data. Even in those situations, we still lean on the Streaming Architecture to replay the events sitting in an exception table for the undelivered stuff. So, if we run into issues with data quality, we have the capability to review and replay those problematic messages.
Lambda Arch : One Streaming Pipeline and another Batch Pipeline.
In next Articles we will discuss more about :
ISB (Class of 2024) | Entrepreneur in Residence & Growth Officer@Smartworks | Ex - Head of Marketing at Smartworks | 10+ yrs. of experience | Scaling Brands | Growth & Founder’s Office | alum - SRCC (Class of 2015)
1 年Anurag, your article sheds light on the complexities of data streaming projects, and your insights into late-arriving events and delivery strategies are invaluable. Adding to your point on ensuring data completeness, leveraging anomaly detection algorithms could further enhance reliability. Excited for Part 2 and to continue exchanging ideas on optimizing data streaming architectures!