登录查看更多内容

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Kuldeep Pal

Data Engineer - III at Walmart | Software Engineer | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java-Spring Boot | ML

发布日期: 2023年1月26日

Apache Spark is a powerful big data processing framework that can be used to build streaming data pipelines. I'm new to Streaming and I was exploring how we can do it and what's the best possible way.

Here are a few examples that I found of how Spark can be used to process streaming data:

Spark Streaming:

Spark Streaming is a built-in module in Spark that allows you to process streaming data using the same API as batch processing. It ingests data streams, divides them into small batches, and processes them using Spark's core engine. It uses a high-level, micro-batch-based API.

Structured Streaming:

This is a new streaming API in Spark that provides a high-level, declarative API for processing structured data streams. It allows you to express your streaming computation as a standard SQL query and it automatically handles the details of running the computation in a distributed environment. It uses a high-level, declarative API.

Still, Confused between both?

Here is the reference to look more at Spark and Kafka:

Upsolver (acquired by Qlik) : Spark Structured Streaming Vs. Apache Spark Streaming | Upsolver

ByteByteGo : Why is Kafka fast? - by Alex Xu - ByteByteGo Newsletter

Flink Streaming:

Apache Flink is a streaming data processing framework that supports both batch and streaming processing. It allows you to express your streaming computation as a dataflow and it automatically handles the details of running the computation in a distributed environment. It uses a high-level, dataflow-based API and it's suitable for use cases that require low latency and stateful computation.

Spark Streaming: In Spark Streaming each batch of data is processed as an RDD (Resilient Distributed Dataset), which is a fundamental data structure in Spark. The processed batches of data are then aggregated to produce the final result.

Spark Streaming supports a variety of data sources, such as Kafka, Flume, Kinesis, and others, and it can process both structured and unstructured data. It also provides a high-level, micro-batch-based API, which allows you to express your streaming computation using operations similar to those used in batch processing, such as map, reduce, filter, and window.

Spark Streaming provides fault tolerance by tracking the lineage information of each RDD, which allows it to recover from node failures. However, it doesn't provide built-in support for state management and event-time processing.

领英推荐

Kafka for Data Engineers

Ankur Ranjan 2 年前

5 Must-Know Distributed Systems Design Patterns for…

Momen Negm 1 年前

Real-time Data Processing with Apache Spark:…

Machine Learning Reply GmbH 11 个月前

Use cases:

Spark Streaming is suitable for use cases that require moderate latencies and that can tolerate some delay. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing.

Structured Streaming: It allows you to express your streaming computation as a standard SQL query, and it automatically handles the details of running the computation in a distributed environment. Structured Streaming supports a variety of data sources, such as Kafka, Kinesis, files, and others, and it can process both structured and semi-structured data.

Structured Streaming provides automatic checkpointing and state management, which allows it to recover from failures and handle late data. It also provides built-in support for event-time processing, which allows you to process data based on the timestamps of the events.

Use cases:

Structured Streaming is suitable for use cases that require low latencies and that can handle late data. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing.

In summary, Spark Streaming provides a micro-batch-based API that can handle moderate latencies, while Structured Streaming provides a declarative SQL-like API that can handle low latencies and handle late data, with built-in support for event-time processing and state management.

Spark’s structured streaming model is an extension built on top of Apache Spark’s DStreams construct. Therefore, users no longer need to access the RDD blocks directly. The structured streaming model utilizes DataFrames, which has the benefits of having a lower latency, a greater throughput, and guaranteed message delivery.

Apache Flink: It is a powerful stream processing framework that can be used to build streaming data pipelines. It allows you to express your streaming computation as a data flow, and it automatically handles the details of running the computation in a distributed environment.

Flink Streaming supports a variety of data sources, such as Kafka, Kinesis, files, and others, and it can process both structured and unstructured data. It also provides a high-level, dataflow-based

API, which allows you to express your streaming computation using operations similar to those used in batch processing, such as map, reduce, filter, and window.

Flink Streaming provides automatic checkpointing and state management, which allows it to recover from failures and handle late data. It also provides built-in support for event-time processing, which allows you to process data based on the timestamps of the events. It also supports stateful computation, which allows you to maintain and update states across events.

Use cases:

Flink Streaming is suitable for use cases that require low latencies and stateful computation. It's well suited for use cases like real-time analytics, fraud detection, and IoT data processing, and also in use cases like windowed aggregation, stateful stream processing, and event-time processing.

In summary, Flink Streaming provides a dataflow-based API that can handle low latencies and stateful computation, with built-in support for event-time processing and state management.

Reference: Flink vs Spark

Thank you for reading our newsletter blog on Streaming. I hope that this information was helpful and will help you keep your data streams running smoothly. If you found this blog useful, please share it with your colleagues and friends. And don't forget to subscribe to our?newsletter?to receive updates on the latest developments in data streaming and other related topics. Until next time, keep streaming!

Software & Data Engineering

6,273 位关注者

Sanket Mehta

Want to solve Problems WHICH MATTER with Data | TCS Alumni ??| On the Path to be Domain & Platform Agnostic Data Professional

2 年

Even though different characteristics (low latency, statefulness, etc), the use cases remain the same across the 3 ways of streaming? Kuldeep

Ankur Ranjan

Software Engineer by heart, Data Engineer by mind

2 年

Good one Kuldeep Pal

2 次回应

查看更多评论

要查看或添加评论，请登录

Kuldeep Pal的更多文章

Inside the Python Virtual Machine

2025年2月16日

Inside the Python Virtual Machine

Inside the Python Virtual Machine: A Deep Dive Inspired from Bangpypers Introduction Python is often described as an…
Building a Modern Data Lakehouse with Dermio(Iceberg) and MinIO: A Hackathon Journey

2025年1月11日

Building a Modern Data Lakehouse with Dermio(Iceberg) and MinIO: A Hackathon Journey

Introduction In this technical deep-dive, I'll share my experience building a modern data lake architecture for…
Understanding Google's serverless data warehouse from the inside out

2024年12月10日

Understanding Google's serverless data warehouse from the inside out

My name is Kuldeep Pal , and I'm fascinated by how modern data systems work under the hood. I spent hours researching…
Communication Protocols: Polling, WebSockets, SSE, gRPC, Message Queues

2024年11月16日

Communication Protocols: Polling, WebSockets, SSE, gRPC, Message Queues

You can communicate on the backend for multiple use cases in multiple ways. This is just a comparison that we need to…
Protecting Sensitive Data in BigQuery: A Comprehensive Guide for HIPAA and PII Compliance

2024年10月2日

Protecting Sensitive Data in BigQuery: A Comprehensive Guide for HIPAA and PII Compliance

When dealing with sensitive data such as Protected Health Information (PHI) under HIPAA or Personally Identifiable…

2 条评论
Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

2024年9月29日

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

Imagine you're moving from a cozy apartment in Indiranagar to a new home in Whitefield, Bengaluru. You've carefully…

1 条评论
AI-Powered Search: Building a Semantic Search Engine with MongoDB and Python

2024年9月13日

AI-Powered Search: Building a Semantic Search Engine with MongoDB and Python

In this blog post, we'll explore how to build a semantic search engine for a movie database using MongoDB Atlas and…

1 条评论
Microservices Killer: Modular Monolithic Architecture

2024年9月9日

Microservices Killer: Modular Monolithic Architecture

You decide to make breakfast using the microservices approach. You have one machine for cracking eggs, another for…
Optimizing BigQuery: Strategies and Techniques for SQL

2024年8月22日

Optimizing BigQuery: Strategies and Techniques for SQL

BigQuery is a powerful data warehouse solution, but to make the most out of it, especially when dealing with large…

1 条评论
Real-Time OLAP with Apache Pinot and Kafka: Practical Project

2024年7月28日

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Introduction Real-time Online Analytical Processing (OLAP) has become increasingly important for businesses that need…

1 条评论

See all articles

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Kuldeep Pal

Data Engineer - III at Walmart | Software Engineer | Spark | Big Data | Python | SQL | AWS | GCP | Scala | Kafka | Datawarehouse | Streaming | Airflow 1x | Java-Spring Boot | ML

Spark Streaming:

Structured Streaming:

Flink Streaming:

领英推荐

Software & Data Engineering

6,273 位关注者

Kuldeep Pal的更多文章

社区洞察

其他会员也浏览了

The hidden costs and risks of implementing Kafka for your enterprise

Building vs. buying: deciding on a Kafka platform

Real-Time Data Streaming Simplified with Apache Kafka

Top 5 Open source monitoring tools for Kubernetes

Top 5 Open source monitoring tools for Kubernetes

Challenges of starting with Kafka

Streaming Metrics for Compute Observability with Kafka

Top 5 Open source monitoring tools for Kubernetes

Why doesn't Netflix crash? . . . Data processing in motion

System Design - A2A communication concepts explained using Apache Kafka

Spark Streaming:

Structured Streaming:

Flink Streaming:

领英推荐

Software & Data Engineering

6,273 位关注者

Kuldeep Pal的更多文章

Inside the Python Virtual Machine

Building a Modern Data Lakehouse with Dermio(Iceberg) and MinIO: A Hackathon Journey

Understanding Google's serverless data warehouse from the inside out

Communication Protocols: Polling, WebSockets, SSE, gRPC, Message Queues

Protecting Sensitive Data in BigQuery: A Comprehensive Guide for HIPAA and PII Compliance

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

AI-Powered Search: Building a Semantic Search Engine with MongoDB and Python

Microservices Killer: Modular Monolithic Architecture

Optimizing BigQuery: Strategies and Techniques for SQL

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

社区洞察

其他会员也浏览了

The hidden costs and risks of implementing Kafka for your enterprise

Building vs. buying: deciding on a Kafka platform

Real-Time Data Streaming Simplified with Apache Kafka

Top 5 Open source monitoring tools for Kubernetes

Top 5 Open source monitoring tools for Kubernetes

Challenges of starting with Kafka

Streaming Metrics for Compute Observability with Kafka

Top 5 Open source monitoring tools for Kubernetes

Why doesn't Netflix crash? . . . Data processing in motion

System Design - A2A communication concepts explained using Apache Kafka