登录查看更多内容

Data Streaming : Part 1

Anurag Kumawat

Manager, Data Engineering at PayPal

发布日期: 2024年2月16日

When I kicked off the streaming project, I thought it'd be a breeze with little to learn. But the more I dug in, the more I realized I had underestimated the complexity. Hence, I decided to share some of the insights and lessons learned.

Complexities

Late Arriving Events

In the real world, dealing with streaming events often means dealing with delays. These delays can crop up due to problems on the producer's side, issues with consumers, glitches in the streaming platform, or timeouts. For applications where the order of events is crucial, like in financial scenarios, these delays can be quite impactful. On the flip side, for other applications where the sequence of events doesn't matter much, using approximation logic can be a practical approach.

Therefore, it's essential to take a step back and really think about the kind of use case we're dealing with before jumping into designing any solution. In my situation, the application I was working on was sensitive, and ensuring the correct order of events was crucial.

Atleast Once versus Atmost Once :

In some cases, it's okay to shoot out a message more than once. The priority here is making sure we don't lose any data, even if it means dealing with possible duplicates and maintaining a decent flow of data. How it usually goes down is the consumer shoots back an acknowledgment to the streaming platform, and once they get it, they try not to throw the same event at us again. But, you know, there's that chance the platform might fire off the message twice before getting the nod for the first one.

Atleast Once : Needs Ack from Consumer

In some situations, we're aiming to deliver messages only once, and we're fine with losing a few along the way. Duplicates are a strict no-no here. The main goal is to prioritize throughput over absolute correctness. Here's how it plays out: the platform shoots a message to the consumer once. If the consumer grabs it and processes it successfully, great. If not, well, that message is gone, and we move on.

领英推荐

These Fascinating Examples Show Why Streaming Data And…

Bernard Marr 4 年前

How Edge Computing and Open Caching are Shaping the…

NETINT Technologies Inc. 3 个月前

Streaming Analytics: Concepts, architectures…

Pethuru Raj PhD, SMIEEE 2 年前

Atmost Once : Fire and Forget strategy

Correctness

For our sensitive application, making sure we've got every single record is the heart of our architecture. You know how it goes – records might slip through the cracks because the producer forgot to hit publish, the consumer couldn't keep up, or the streaming platform has issues, leaving some records stranded in exception category.

Now, enter the Lambda Architecture. The game plan here is simple but effective – we've got two processes on the scene. One taps into real-time streaming data, giving us the live beat of our business, while the other is this ETL Batch pipeline that snags data multiple times a day or when the day wraps up. The catch, though, is that throwing in that extra Batch pipeline ensures we've got 100% of the data, but it does make life a bit more complicated with the whole managing-two-pipelines gig.

Here's the kicker – sometimes the batch pipeline doesn't bring complete data with payload - It just grabs event IDs for a quality check against the streaming data. Even in those situations, we still lean on the Streaming Architecture to replay the events sitting in an exception table for the undelivered stuff. So, if we run into issues with data quality, we have the capability to review and replay those problematic messages.

Lambda Arch : One Streaming Pipeline and another Batch Pipeline.

In next Articles we will discuss more about :

Lambda Arch vs Kappa Arch
Windowing : Event Time vs Processing Time.
Batch ETL strategies to consume Late arriving events.

Archit Sharma

1 年

Anurag, your article sheds light on the complexities of data streaming projects, and your insights into late-arriving events and delivery strategies are invaluable. Adding to your point on ensuring data completeness, leveraging anomaly detection algorithms could further enhance reliability. Excited for Part 2 and to continue exchanging ideas on optimizing data streaming architectures!

2 次回应

要查看或添加评论，请登录

Anurag Kumawat的更多文章

Solving Kafka Rebalancing Problem: Learning from a P2 issue

2024年10月10日

Solving Kafka Rebalancing Problem: Learning from a P2 issue

Overview Imagine you’ve just deployed a large-scale Kafka application with over 1,000 instances. Everything's running…

6 条评论

Data Streaming : Part 1

Anurag Kumawat

Manager, Data Engineering at PayPal

Complexities

领英推荐

Anurag Kumawat的更多文章

社区洞察

其他会员也浏览了

Architectural considerations in video streaming AI technology

Spark Streaming: Session 2

Content Ingestion in Video Streaming: Navigating the Challenges

Video Encoding: The First Step in Streaming Success

AMAZON KINESIS

Examining Business Intelligence in Adopting Digital Business Model Transformation in Netflix Subscription-Based Streaming Service Industry.

How Netflix Uses Apache Kafka to Deliver Seamless Streaming Experiences

Harness the power of streaming without expertise

High-Level Design of Netflix: A Deep Dive

Complexities

领英推荐

Anurag Kumawat的更多文章

Solving Kafka Rebalancing Problem: Learning from a P2 issue

社区洞察

其他会员也浏览了

Architectural considerations in video streaming AI technology

Spark Streaming: Session 2

Content Ingestion in Video Streaming: Navigating the Challenges

Video Encoding: The First Step in Streaming Success

AMAZON KINESIS

Examining Business Intelligence in Adopting Digital Business Model Transformation in Netflix Subscription-Based Streaming Service Industry.

How Netflix Uses Apache Kafka to Deliver Seamless Streaming Experiences

Harness the power of streaming without expertise

High-Level Design of Netflix: A Deep Dive