ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Stream Processing Using Apache Flink

Augusto de NevrezÃ©

Data Engineer | Data Scientist | Electronics Engineer

å‘å¸ƒæ—¥æœŸ: 2024å¹´5æœˆ30æ—¥

+ å…³æ³¨

Getting Started

Introduction

In our previous discussion, we highlighted the importance of data streaming for near-real-time data applications and their transformative impact on businesses. Whether it's enhancing analytics, facilitating fluid interaction with large language models, or shortening time-to-insights, the demand for streaming pipelines is growing and is expected to continue rising. However, the discussed proposal lacks adequate data transformation capabilities as messages flow through the streaming pipeline. In that scenario, transformations are deferred until the data reaches the Data Warehouse, where it is subsequently processed in batches.

Processing vast amounts of data quickly, reliably, and at scale presents several challenges. From a data consumer point of view, batch processing treats the query engine as an external observer, where the entire dataset is available at once. In contrast, streaming data locates the observer directly inside the stream and only chunks of information can be processed as it flows through the pipeline. This change of paradigm increases the complexity of data processing.

There are several ways to transform an incoming stream, one of which involves enriching the messages with data from an external database. This database could be a transactional database, object storage, or even a NoSQL database. Depending on the required throughput, caching capabilities can also be incorporated into the external database to improve performance. For instance, in an IoT application, device information traveling through the pipeline can be enriched by using the device ID to fetch additional details from the database. The enriched message is then sent downstream, enhancing the value of the data pipeline. This approach not only reduces costs in the analytics layer, where the data is ultimately stored, but also increases the overall efficiency and effectiveness of the data processing workflow.

An incoming stream being enriched with an external DB

Data can be also combined from different streams and merged into a single pipeline. This requires a certain degree of synchronicity between the streams; otherwise, it won't be possible to pack them tpgether. To achieve this, various windowing techniques are used to wait for a specific amount of time before generating the output. However, this type of combination can degrade overall performance, as latency is affected by the waiting mechanisms inherent to windowing. Throughput might also be impacted due to the computational cost of merging the streams. Despite these challenges, this approach enhances the real-time information flowing through the pipeline. For example, merging data from user preferences with advertising content can significantly improve the customer experience.

As briefly mentioned, the nature of stream processing requires specific techniques such as windows and watermarks, to handle the unique characteristics of streaming data. While the details of these techniques are beyond the scope of this blog entry, they are crucial for managing data flow and ensuring accurate processing. We will cover these methods in more depth in future posts.

The (Almost) Definitive Solution: Apache Flink

Apache Flink is an increasingly popular open-source stream processing framework that is gaining interest and adoption as a key component in the open source data stack. Among the features that contribute to its widespread adoption we recognize the following:

Powerful Runtime: Flink provides exceptional resource efficiency, high throughput with low latency, robust state management, fault tolerance, and horizontal scalability, making it suitable for demanding applications.
Comprehensive API and Language Support: Flink supports multiple languages and offers a rich API, making it accessible and flexible for developers.
Unified Processing: Flink allows for both stream and batch processing using the same tools, simplifying the architecture and development process.
Production-Readiness: Flink is designed with observability and stability in mind, ensuring it can handle production workloads reliably.

Key Use Cases and Success Stories

Data streaming is particularly essential in scenarios where real-time data processing is critical. For instance, financial services use Flink for fraud detection, real-time risk management, and algorithmic trading. Retail companies leverage Flink to personalize customer experiences and optimize inventory management. Streaming applications in manufacturing improve operational efficiency by monitoring equipment health and predicting maintenance needs.

Success stories include:

Uber: Uses Flink to process real-time data for dynamic pricing and trip optimization.
Netflix: Utilizes Flink for real-time recommendation systems and monitoring streaming quality.
ING: Employs Flink for real-time fraud detection and transaction monitoring.

Apache Flinkâ€™s APIs and language support

Apache Flink offers four distinct APIs, each designed for different users and use cases, providing a versatile range of programming language support including Python, Java, and SQL. The layered APIs in Flink cater to various levels of abstraction, enabling flexibility for both common and specialized tasks. The DataStream API, available in Java and Python, allows for creating dataflow graphs through transformation functions like flatMap, filter, and process, granting fine-grained control over how records flow within the applications. This API will feel familiar to users of the Kafka Streams DSL and Kafka Processor API due to its similar functionalities.

The Table API, Flinkâ€™s next-generation declarative API, facilitates building programs using relational operations such as joins, filters, aggregations, and projections, and is optimized like Flink SQL queries. Available in Java and Python, it shares many features with SQL including type systems and built-in functions. Flink SQL, which is ANSI standard compliant, processes both real-time and historical data using Apache Calcite for query optimization. It supports complex queries and has a broad ecosystem. Additionally, Flinkâ€™s "Stateful Functions" sub-project simplifies the development of stateful, distributed event-driven applications, functioning as a stateful, fault-tolerant, distributed Actor system. This breadth of API options ensures that Apache Flink can meet diverse stream processing needs, allowing for seamless integration and evolution of services over time.

Connectors

Apache Flink offers a variety of connectors that integrate seamlessly with different data storage systems, enhancing interoperability between services, which is crucial in an open-source data stack environment. These connectors allow Flink to ingest, process, and output data from/to various systems, ensuring smooth data flow across different platforms.

Apache Kafka (source/sink)
Apache Cassandra (source/sink)
Amazon DynamoDB (sink)
Amazon Kinesis Data Streams (source/sink)
Amazon Kinesis Data Firehose (sink)
DataGen (source)
Elasticsearch (sink)
Opensearch (sink)
FileSystem (source/sink)
RabbitMQ (source/sink)
Google PubSub (source/sink)
Hybrid Source (source)
Apache Pulsar (source)
Apache Iceberg (sink)
JDBC (sink)
MongoDB (source/sink)

These connectors significantly expand Flinkâ€™s capability to integrate with various ecosystems, allowing for versatile data processing pipelines that can handle diverse data sources and destinations. This flexibility makes Flink an indispensable tool for building robust, real-time data processing applications across different industries and use cases.

Comparison with Apache Spark Structured Streaming

While both Apache Flink and Apache Spark Structured Streaming are powerful stream processing frameworks, there are key differences:

True Stream Processing vs. Micro-batching: Flink is designed for true stream processing, handling data on an event-by-event basis. In contrast, Spark Structured Streaming uses micro-batching, processing small batches of data at short intervals. This can lead to higher latency compared to Flink.
State Management: Flink's advanced state management capabilities and support for exactly-once semantics make it particularly suited for complex event processing and stateful applications.
API and Integration: Flink offers a more extensive set of APIs and integrations for complex stream processing tasks, while Spark is often favored for its unified engine and ease of transitioning from batch to streaming.

As it can be seen in the following picture taken from the Structured Streaming Documentation, the Spark job runs at given intervals, in this case every second. It appends the data stream into an unbounded result table. If you want to dive deeper, there are more insertion mechanisms described in the documentation.

Implementation

Streaming Architecture to Implement

Alright! Enough with the theory, letâ€™s jump into action. The following diagram shows the architecture that weâ€™re going to build on our local machine. The main idea is to have a complete example of a real-world scenario where a stream processor can be used for something useful. In this case, weâ€™re going to use it to filter alerts coming from IoT sensor data.

The first block constitutes the actual data publisher, the source of information. It simulates data potentially coming from different IoT devices and has three measurements: temperature, pressure, and vibration. You can configure the rate at which the information is published to the topic by defining a SLEEP_TIME constant between messages in the producer script. Data is then stored in a Kafka topic called sensors. Finally, a Flink instance subscribes to the topic to read the data and filter the messages from the sensors that have a temperature value higher than 30. You can configure this by setting the TEMP_THRESHOLD value. Messages will be displayed in the standard output (console) for now, but eventually, this will be sunk to another storage device.

The idea is to keep things simple, experimenting with Kafka and Flink in our local environment before moving to cloud infrastructure. We think adding cloud complexity for a first interaction with these services is unnecessary. What weâ€™re going to create is simple enough to run on a local machine, so we can build and debug with ease.

Requirements

Thereâ€™s a guide in the repository README.md where all the installation steps are covered in detail. You donâ€™t have to follow all the instructions, but we recommend installing pyenv and poetry, which are better alternatives to pip and conda without the conda packaging overhead (even using miniconda). Furthermore, poetry can be easily integrated into a CI/CD pipeline. You can skip this and simply install the requirements from requirements.txt, so itâ€™s up to you. We have often struggled with package dependencies that poetry can resolve natively. If you havenâ€™t tried it yet, we think this is a good opportunity. Let us know if you like it or if you struggle with installing pyenv in the comments.

Summary of Local Setup

Hereâ€™s a summary of what needs to be done locally:

Download and install:
Clone the repo
Create a Python 3.10 development environment
Install the requirements (using pip or poetry) so we can run Kafka and Flink locally.

Python Script - Producer

Weâ€™ve briefly talked about the producer; it simply simulates several sensors ingesting data to the topic. When developing a streaming pipeline, one of the most important things is to design a message structure, also known as a contract, which defines how the streaming processor will read data. More data can be added from an IoT device, hence more sensors and data exchange. Itâ€™s common to use JSON objects as messages, which have the flexibility to change over time. Provided that you donâ€™t alter the structure significantly, youâ€™ll maintain access to the relevant information.

We believe itâ€™s also important to add some parameters to the exchanged message to facilitate its traceability. At least two fields should be included to identify the uniqueness and generation time of each message. This way, itâ€™s easier to sort them and remove duplicates, whether the processing takes place during the streaming pipeline or in a later data manipulation stage, such as in a data warehouse/lake.

Considering the above, this is the message exchanged:

é¢†è‹±æŽ¨è

Serverless Model Deployment in AWS: Streamlining with Lambda, Docker, and S3

Serverless Model Deployment in AWS: Streamlining withâ€¦

Jon Bonso 11 ä¸ªæœˆå‰

The Future of Data-Driven Ecosystems: Cloud Platforms, Data Platforms, Python, Data Engineering, Automation, and Generative AI

The Future of Data-Driven Ecosystems: Cloud Platforms,â€¦

Jay S. 2 ä¸ªæœˆå‰

Vector Databases: Open Source and Commercial Solutions

Sanjay Kumar MBA,MS,PhD 6 ä¸ªæœˆå‰

Flink Streaming Processor - Consumer

Regarding the consumer, our stream processor is implemented using PyFlink, the Python flavor for Apache Flink. To use the framework correctly, several steps need to be declared. A general program flow should involve the following steps:

Obtain an execution environment,
Load the initial data or connect to a stream source,
Specify transformations on this data,
Specify where to put the results of your computations,
Trigger the program execution

First, an environment has to be initialized. In that environment, we declare the source used and import the needed connector packages. In this case, the package enables us to connect with Kafka.

Bear in mind that this particular connector depends on the Flink version used. For this reason, itâ€™s important to refer to the official documentation and download the correct .jar package. At the time of writing this article, the connector for Apache Flink 1.19 was not yet available, forcing us to use the previous version 1.18. The README has more details regarding the link used to download it.

It's also important to mention that a legacy library exists to connect with Kafka. We encourage you to use the official package available in the documentation, which uses the KafkaSource class. The documentation describes the parameters configured in the KafkaSource to build our kafka_source object correctly.

verify that the JAR file aligns with the version used

With the environment created and properly configured, our kafka_source can be added and a DataStream object instantiated to read the incoming stream and transform it. For our source, weâ€™re not declaring watermark strategies; data will be processed as it arrives in the simplest way.

In the core of the script, transformations are done using the map function from the DataStream API. The streaming message is manipulated, and only the ones containing a string are passed downstream.

The parse_and_filter function is declared as follows. The filtering process is relatively simple, and the result is a string with a message declared as alert_message. Notice that weâ€™re expecting a string as the function input because weâ€™ve configured it as a SimpleStringSchema to deserialize Kafka messages.

The final step in every Flink script is to actually execute the environment. This is achieved with the execute method.

This setup provides a simple yet effective demonstration of how to use Kafka and Flink for processing IoT sensor data on a local machine.

Next Steps

In the next part of this series, we'll take our streaming architecture to the next level by moving beyond the local environment. Hereâ€™s a sneak peek at what weâ€™ll cover:

1. Use Containers to Run Services in a Provisioned Cluster

Running our services on a local machine is great for development and testing, but to simulate real-world conditions more closely, we'll move to a containerized environment. Weâ€™ll use Docker to containerize our Kafka and Flink services and deploy them in a provisioned cluster. This setup will mimic cloud infrastructure, allowing us to test our streaming pipeline under conditions similar to those in a production environment. Weâ€™ll explore how to set up Docker Compose files to manage multi-container Docker applications, ensuring all services are correctly networked and configured.

2. Manage Flink Jobs through the GUI Interface

Once we have our cluster up and running, managing Flink jobs efficiently becomes crucial. We'll dive into the Flink Dashboard, a powerful GUI interface that simplifies job management. Through the dashboard, weâ€™ll show you how to submit, monitor, and manage Flink jobs. This visual tool will help us gain insights into job performance, identify bottlenecks, and troubleshoot issues more effectively.

3. Transform Data Using the SQL API

Finally, weâ€™ll explore the power of the SQL API in Flink. This feature allows us to write SQL queries to perform complex transformations on our data streams. Weâ€™ll demonstrate how to use the SQL API to filter, aggregate, and join data streams, making our data processing pipeline more robust and versatile. This approach leverages the familiarity and expressiveness of SQL, making stream processing more accessible and powerful.

Closing Notes

As we wrap up this second part of our journey into building a streaming architecture, it's important to highlight a few key takeaways:

Simplicity First

When designing a streaming data pipeline, simplicity should always be a priority. If your transformations are straightforward, such as basic filtering or aggregation, consider whether these tasks can be performed directly on Kafka using Kafka Streams or KSQL. These tools allow for powerful stream processing capabilities directly within Kafka, reducing the need for additional components. This can simplify your architecture and reduce operational complexity.

Synergy of Components

The synergy between Kafka and Flink is a powerful aspect of our setup. Kafka serves as a robust storage and messaging system, efficiently handling high-throughput data streams. Flink complements this by providing powerful compute capabilities for complex data transformations and real-time analytics. Together, they form a resilient and scalable platform for processing IoT sensor data and many other streaming applications.

Flink Adoptability

Apache Flink's flexibility and comprehensive feature set make it an excellent choice for stream processing. Its support for various APIs, including DataStream and SQL, makes it accessible to both developers and data analysts. As you become more familiar with Flink, you'll find that it can handle a wide range of use cases, from simple event-driven applications to complex data analytics workflows.

By focusing on simplicity, leveraging the strengths of Kafka and Flink, and embracing Flinkâ€™s robust capabilities, you can build a highly effective and scalable streaming data pipeline.?

We look forward to exploring these topics further and helping you take your streaming architecture to new heights. Stay tuned for the next part of our series!

References

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

https://www.confluent.io/blog/apache-flink-for-stream-processing/

Fundamentals of Data Engineering - Joe Reis, Matt Housley

https://docs.aws.amazon.com/whitepapers/latest/build-modern-data-streaming-analytics-architectures/what-is-a-modern-streaming-data-architecture.html

https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/overview/

https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/datastream/kafka/#apache-kafka-connector

https://kafka.apache.org/37/documentation/streams/core-concepts

https://medium.com/@adenevreze/stream-processing-using-apache-flink-getting-started-7a2be19ad9dd

https://www.dhirubhai.net/pulse/building-streaming-pipeline-augusto-de-nevrez%2525C3%2525A9-msxaf

https://devblogit.com/how-uber-uses-kafka/

https://flink.apache.org/what-is-flink/powered-by/

https://netflixtechblog.com/streaming-sql-in-data-mesh-0d83f5a00d08

https://www.ing.jobs/global/careers/expertise/tech/stories-1/article/how-to-get-fast-predictions-from-real-time-data.htm

https://ksqldb.io/

?? Andrew Dickson

Writer | Coach

9 ä¸ªæœˆ

Can't wait to read your insights on data streaming with Apache Flink.

èµž

å›žå¤

2 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Say Goodbye to ETL Headaches: Build Your Cloud-Native Streaming Pipeline with AWS Managed Apache Flink!

2024å¹´8æœˆ29æ—¥

Say Goodbye to ETL Headaches: Build Your Cloud-Native Streaming Pipeline with AWS Managed Apache Flink!

Say Goodbye to ETL Headaches: Build Your Cloud-Native Streaming Pipeline with AWS Managed Apache Flink! Introduction:â€¦

2 æ¡è¯„è®º
How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

2024å¹´6æœˆ19æ—¥

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

Integrating pyFlink, Kafka, and PostgreSQL using Docker Why Read This? - Real-World Insights: Get practical tips fromâ€¦
Building a Streaming Pipeline

2024å¹´5æœˆ14æ—¥

Building a Streaming Pipeline

Using Amazon Kinesis Data Streams and Redshift to develop a Modern Streaming Pipeline Architecture Intro The demand forâ€¦

5 æ¡è¯„è®º

Stream Processing Using Apache Flink

Augusto de NevrezÃ©

Data Engineer | Data Scientist | Electronics Engineer

Getting Started

Introduction

The (Almost) Definitive Solution: Apache Flink

Key Use Cases and Success Stories

Apache Flinkâ€™s APIs and language support

Connectors

Comparison with Apache Spark Structured Streaming

Implementation

Streaming Architecture to Implement

Requirements

Python Script - Producer

é¢†è‹±æŽ¨è

Flink Streaming Processor - Consumer

Next Steps

1. Use Containers to Run Services in a Provisioned Cluster

2. Manage Flink Jobs through the GUI Interface

3. Transform Data Using the SQL API

Closing Notes

Simplicity First

Synergy of Components

Flink Adoptability

References

Augusto de NevrezÃ©çš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Issue #199 - THE ML ENGINEER ??

History of OpenTelemetry

Mastering Observability with OpenTelemetry

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

DATA Pill #026 - choose your cloud, leave the scrum and look at Tinder API Gateway

A Brief History of AI

The Revolution of Vector Databases: Insights from Shalini and Shirsha

Building a Retrieval Augmented Generation (RAG) Platform for Californiaâ€™s MLS Using Microsoft Semantic Kernel (pre-release)

Leveraging Generative AI (GenAI) to Build Scalable Solutions on GCP(Google Cloud Platform)

Getting Started

Introduction

The (Almost) Definitive Solution: Apache Flink

Key Use Cases and Success Stories

Apache Flinkâ€™s APIs and language support

Connectors

Comparison with Apache Spark Structured Streaming

Implementation

Streaming Architecture to Implement

Requirements

Python Script - Producer

é¢†è‹±æŽ¨è

Flink Streaming Processor - Consumer

Next Steps

1. Use Containers to Run Services in a Provisioned Cluster

2. Manage Flink Jobs through the GUI Interface

3. Transform Data Using the SQL API

Closing Notes

Simplicity First

Synergy of Components

Flink Adoptability

References

Augusto de NevrezÃ©çš„æ›´å¤šæ–‡ç«

Say Goodbye to ETL Headaches: Build Your Cloud-Native Streaming Pipeline with AWS Managed Apache Flink!

How I Dockerized Apache Flink, Kafka, and PostgreSQL for Real-Time Data Streaming

Building a Streaming Pipeline

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Issue #199 - THE ML ENGINEER ??

History of OpenTelemetry

Mastering Observability with OpenTelemetry

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

DATA Pill #026 - choose your cloud, leave the scrum and look at Tinder API Gateway

A Brief History of AI

The Revolution of Vector Databases: Insights from Shalini and Shirsha

Building a Retrieval Augmented Generation (RAG) Platform for Californiaâ€™s MLS Using Microsoft Semantic Kernel (pre-release)

Leveraging Generative AI (GenAI) to Build Scalable Solutions on GCP(Google Cloud Platform)

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†