Streamlining Your Data: An Overview of Different Types of Streaming Pipelines
Kuldeep Pal
Data Engineer - III at Walmart | Software Engineer | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java-Spring Boot | ML
Apache Spark is a powerful big data processing framework that can be used to build streaming data pipelines. I'm new to Streaming and I was exploring how we can do it and what's the best possible way.
Here are a few examples that I found of how Spark can be used to process streaming data:
Spark Streaming:
Spark Streaming is a built-in module in Spark that allows you to process streaming data using the same API as batch processing. It ingests data streams, divides them into small batches, and processes them using Spark's core engine. It uses a high-level, micro-batch-based API.
Structured Streaming:
This is a new streaming API in Spark that provides a high-level, declarative API for processing structured data streams. It allows you to express your streaming computation as a standard SQL query and it automatically handles the details of running the computation in a distributed environment. It uses a high-level, declarative API.
Still, Confused between both?
Here is the reference to look more at Spark and Kafka:
Flink Streaming:
Apache Flink is a streaming data processing framework that supports both batch and streaming processing. It allows you to express your streaming computation as a dataflow and it automatically handles the details of running the computation in a distributed environment. It uses a high-level, dataflow-based API and it's suitable for use cases that require low latency and stateful computation.
Spark Streaming: In Spark Streaming each batch of data is processed as an RDD (Resilient Distributed Dataset), which is a fundamental data structure in Spark. The processed batches of data are then aggregated to produce the final result.
Spark Streaming supports a variety of data sources, such as Kafka, Flume, Kinesis, and others, and it can process both structured and unstructured data. It also provides a high-level, micro-batch-based API, which allows you to express your streaming computation using operations similar to those used in batch processing, such as map, reduce, filter, and window.
Spark Streaming provides fault tolerance by tracking the lineage information of each RDD, which allows it to recover from node failures. However, it doesn't provide built-in support for state management and event-time processing.
领英推荐
Use cases:
Spark Streaming is suitable for use cases that require moderate latencies and that can tolerate some delay. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing.
Structured Streaming: It allows you to express your streaming computation as a standard SQL query, and it automatically handles the details of running the computation in a distributed environment. Structured Streaming supports a variety of data sources, such as Kafka, Kinesis, files, and others, and it can process both structured and semi-structured data.
Structured Streaming provides automatic checkpointing and state management, which allows it to recover from failures and handle late data. It also provides built-in support for event-time processing, which allows you to process data based on the timestamps of the events.
Use cases:
Structured Streaming is suitable for use cases that require low latencies and that can handle late data. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing.
In summary, Spark Streaming provides a micro-batch-based API that can handle moderate latencies, while Structured Streaming provides a declarative SQL-like API that can handle low latencies and handle late data, with built-in support for event-time processing and state management.
Spark’s structured streaming model is an extension built on top of Apache Spark’s DStreams construct. Therefore, users no longer need to access the RDD blocks directly. The structured streaming model utilizes DataFrames, which has the benefits of having a lower latency, a greater throughput, and guaranteed message delivery.
Apache Flink: It is a powerful stream processing framework that can be used to build streaming data pipelines. It allows you to express your streaming computation as a data flow, and it automatically handles the details of running the computation in a distributed environment.
Flink Streaming supports a variety of data sources, such as Kafka, Kinesis, files, and others, and it can process both structured and unstructured data. It also provides a high-level, dataflow-based
API, which allows you to express your streaming computation using operations similar to those used in batch processing, such as map, reduce, filter, and window.
Flink Streaming provides automatic checkpointing and state management, which allows it to recover from failures and handle late data. It also provides built-in support for event-time processing, which allows you to process data based on the timestamps of the events. It also supports stateful computation, which allows you to maintain and update states across events.
Use cases:
Flink Streaming is suitable for use cases that require low latencies and stateful computation. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing, and also in use cases like windowed aggregation, stateful stream processing, and event-time processing.
In summary, Flink Streaming provides a dataflow-based API that can handle low latencies and stateful computation, with built-in support for event-time processing and state management.
Reference: Flink vs Spark
Thank you for reading our newsletter blog on Streaming. I hope that this information was helpful and will help you keep your data streams running smoothly. If you found this blog useful, please share it with your colleagues and friends. And don't forget to subscribe to our?newsletter?to receive updates on the latest developments in data streaming and other related topics. Until next time, keep streaming!
Want to solve Problems WHICH MATTER with Data | TCS Alumni ??| On the Path to be Domain & Platform Agnostic Data Professional
2 年Even though different characteristics (low latency, statefulness, etc), the use cases remain the same across the 3 ways of streaming? Kuldeep
Software Engineer by heart, Data Engineer by mind
2 年Good one Kuldeep Pal