Stream Processing Using Apache Flink
Getting Started
Introduction
In our previous discussion, we highlighted the importance of data streaming for near-real-time data applications and their transformative impact on businesses. Whether it's enhancing analytics, facilitating fluid interaction with large language models, or shortening time-to-insights, the demand for streaming pipelines is growing and is expected to continue rising. However, the discussed proposal lacks adequate data transformation capabilities as messages flow through the streaming pipeline. In that scenario, transformations are deferred until the data reaches the Data Warehouse, where it is subsequently processed in batches.
Processing vast amounts of data quickly, reliably, and at scale presents several challenges. From a data consumer point of view, batch processing treats the query engine as an external observer, where the entire dataset is available at once. In contrast, streaming data locates the observer directly inside the stream and only chunks of information can be processed as it flows through the pipeline. This change of paradigm increases the complexity of data processing.
There are several ways to transform an incoming stream, one of which involves enriching the messages with data from an external database. This database could be a transactional database, object storage, or even a NoSQL database. Depending on the required throughput, caching capabilities can also be incorporated into the external database to improve performance. For instance, in an IoT application, device information traveling through the pipeline can be enriched by using the device ID to fetch additional details from the database. The enriched message is then sent downstream, enhancing the value of the data pipeline. This approach not only reduces costs in the analytics layer, where the data is ultimately stored, but also increases the overall efficiency and effectiveness of the data processing workflow.
Data can be also combined from different streams and merged into a single pipeline. This requires a certain degree of synchronicity between the streams; otherwise, it won't be possible to pack them tpgether. To achieve this, various windowing techniques are used to wait for a specific amount of time before generating the output. However, this type of combination can degrade overall performance, as latency is affected by the waiting mechanisms inherent to windowing. Throughput might also be impacted due to the computational cost of merging the streams. Despite these challenges, this approach enhances the real-time information flowing through the pipeline. For example, merging data from user preferences with advertising content can significantly improve the customer experience.
As briefly mentioned, the nature of stream processing requires specific techniques such as windows and watermarks, to handle the unique characteristics of streaming data. While the details of these techniques are beyond the scope of this blog entry, they are crucial for managing data flow and ensuring accurate processing. We will cover these methods in more depth in future posts.
The (Almost) Definitive Solution: Apache Flink
Apache Flink is an increasingly popular open-source stream processing framework that is gaining interest and adoption as a key component in the open source data stack. Among the features that contribute to its widespread adoption we recognize the following:
- Powerful Runtime: Flink provides exceptional resource efficiency, high throughput with low latency, robust state management, fault tolerance, and horizontal scalability, making it suitable for demanding applications.
- Comprehensive API and Language Support: Flink supports multiple languages and offers a rich API, making it accessible and flexible for developers.
- Unified Processing: Flink allows for both stream and batch processing using the same tools, simplifying the architecture and development process.
- Production-Readiness: Flink is designed with observability and stability in mind, ensuring it can handle production workloads reliably.
Key Use Cases and Success Stories
Data streaming is particularly essential in scenarios where real-time data processing is critical. For instance, financial services use Flink for fraud detection, real-time risk management, and algorithmic trading. Retail companies leverage Flink to personalize customer experiences and optimize inventory management. Streaming applications in manufacturing improve operational efficiency by monitoring equipment health and predicting maintenance needs.
Success stories include:
- Uber: Uses Flink to process real-time data for dynamic pricing and trip optimization.
- Netflix: Utilizes Flink for real-time recommendation systems and monitoring streaming quality.
- ING: Employs Flink for real-time fraud detection and transaction monitoring.
Apache Flink’s APIs and language support
Apache Flink offers four distinct APIs, each designed for different users and use cases, providing a versatile range of programming language support including Python, Java, and SQL. The layered APIs in Flink cater to various levels of abstraction, enabling flexibility for both common and specialized tasks. The DataStream API, available in Java and Python, allows for creating dataflow graphs through transformation functions like flatMap, filter, and process, granting fine-grained control over how records flow within the applications. This API will feel familiar to users of the Kafka Streams DSL and Kafka Processor API due to its similar functionalities.
The Table API, Flink’s next-generation declarative API, facilitates building programs using relational operations such as joins, filters, aggregations, and projections, and is optimized like Flink SQL queries. Available in Java and Python, it shares many features with SQL including type systems and built-in functions. Flink SQL, which is ANSI standard compliant, processes both real-time and historical data using Apache Calcite for query optimization. It supports complex queries and has a broad ecosystem. Additionally, Flink’s "Stateful Functions" sub-project simplifies the development of stateful, distributed event-driven applications, functioning as a stateful, fault-tolerant, distributed Actor system. This breadth of API options ensures that Apache Flink can meet diverse stream processing needs, allowing for seamless integration and evolution of services over time.
Connectors
Apache Flink offers a variety of connectors that integrate seamlessly with different data storage systems, enhancing interoperability between services, which is crucial in an open-source data stack environment. These connectors allow Flink to ingest, process, and output data from/to various systems, ensuring smooth data flow across different platforms.
- Apache Kafka (source/sink)
- Apache Cassandra (source/sink)
- Amazon DynamoDB (sink)
- Amazon Kinesis Data Streams (source/sink)
- Amazon Kinesis Data Firehose (sink)
- DataGen (source)
- Elasticsearch (sink)
- Opensearch (sink)
- FileSystem (source/sink)
- RabbitMQ (source/sink)
- Google PubSub (source/sink)
- Hybrid Source (source)
- Apache Pulsar (source)
- Apache Iceberg (sink)
- JDBC (sink)
- MongoDB (source/sink)
These connectors significantly expand Flink’s capability to integrate with various ecosystems, allowing for versatile data processing pipelines that can handle diverse data sources and destinations. This flexibility makes Flink an indispensable tool for building robust, real-time data processing applications across different industries and use cases.
Comparison with Apache Spark Structured Streaming
While both Apache Flink and Apache Spark Structured Streaming are powerful stream processing frameworks, there are key differences:
- True Stream Processing vs. Micro-batching: Flink is designed for true stream processing, handling data on an event-by-event basis. In contrast, Spark Structured Streaming uses micro-batching, processing small batches of data at short intervals. This can lead to higher latency compared to Flink.
- State Management: Flink's advanced state management capabilities and support for exactly-once semantics make it particularly suited for complex event processing and stateful applications.
- API and Integration: Flink offers a more extensive set of APIs and integrations for complex stream processing tasks, while Spark is often favored for its unified engine and ease of transitioning from batch to streaming.
As it can be seen in the following picture taken from the Structured Streaming Documentation, the Spark job runs at given intervals, in this case every second. It appends the data stream into an unbounded result table. If you want to dive deeper, there are more insertion mechanisms described in the documentation.
Implementation
Streaming Architecture to Implement
Alright! Enough with the theory, let’s jump into action. The following diagram shows the architecture that we’re going to build on our local machine. The main idea is to have a complete example of a real-world scenario where a stream processor can be used for something useful. In this case, we’re going to use it to filter alerts coming from IoT sensor data.
The first block constitutes the actual data publisher, the source of information. It simulates data potentially coming from different IoT devices and has three measurements: temperature, pressure, and vibration. You can configure the rate at which the information is published to the topic by defining a SLEEP_TIME constant between messages in the producer script. Data is then stored in a Kafka topic called sensors. Finally, a Flink instance subscribes to the topic to read the data and filter the messages from the sensors that have a temperature value higher than 30. You can configure this by setting the TEMP_THRESHOLD value. Messages will be displayed in the standard output (console) for now, but eventually, this will be sunk to another storage device.
The idea is to keep things simple, experimenting with Kafka and Flink in our local environment before moving to cloud infrastructure. We think adding cloud complexity for a first interaction with these services is unnecessary. What we’re going to create is simple enough to run on a local machine, so we can build and debug with ease.
Requirements
There’s a guide in the repository README.md where all the installation steps are covered in detail. You don’t have to follow all the instructions, but we recommend installing pyenv and poetry, which are better alternatives to pip and conda without the conda packaging overhead (even using miniconda). Furthermore, poetry can be easily integrated into a CI/CD pipeline. You can skip this and simply install the requirements from requirements.txt, so it’s up to you. We have often struggled with package dependencies that poetry can resolve natively. If you haven’t tried it yet, we think this is a good opportunity. Let us know if you like it or if you struggle with installing pyenv in the comments.
Summary of Local Setup
Here’s a summary of what needs to be done locally:
- Download and install:
- Clone the repo
- Create a Python 3.10 development environment
- Install the requirements (using pip or poetry) so we can run Kafka and Flink locally.
Python Script - Producer
We’ve briefly talked about the producer; it simply simulates several sensors ingesting data to the topic. When developing a streaming pipeline, one of the most important things is to design a message structure, also known as a contract, which defines how the streaming processor will read data. More data can be added from an IoT device, hence more sensors and data exchange. It’s common to use JSON objects as messages, which have the flexibility to change over time. Provided that you don’t alter the structure significantly, you’ll maintain access to the relevant information.
We believe it’s also important to add some parameters to the exchanged message to facilitate its traceability. At least two fields should be included to identify the uniqueness and generation time of each message. This way, it’s easier to sort them and remove duplicates, whether the processing takes place during the streaming pipeline or in a later data manipulation stage, such as in a data warehouse/lake.
Considering the above, this is the message exchanged:
领英推è
Flink Streaming Processor - Consumer
Regarding the consumer, our stream processor is implemented using PyFlink, the Python flavor for Apache Flink. To use the framework correctly, several steps need to be declared. A general program flow should involve the following steps:
- Obtain an execution environment,
- Load the initial data or connect to a stream source,
- Specify transformations on this data,
- Specify where to put the results of your computations,
- Trigger the program execution
First, an environment has to be initialized. In that environment, we declare the source used and import the needed connector packages. In this case, the package enables us to connect with Kafka.
Bear in mind that this particular connector depends on the Flink version used. For this reason, it’s important to refer to the official documentation and download the correct .jar package. At the time of writing this article, the connector for Apache Flink 1.19 was not yet available, forcing us to use the previous version 1.18. The README has more details regarding the link used to download it.
It's also important to mention that a legacy library exists to connect with Kafka. We encourage you to use the official package available in the documentation, which uses the KafkaSource class. The documentation describes the parameters configured in the KafkaSource to build our kafka_source object correctly.
With the environment created and properly configured, our kafka_source can be added and a DataStream object instantiated to read the incoming stream and transform it. For our source, we’re not declaring watermark strategies; data will be processed as it arrives in the simplest way.
In the core of the script, transformations are done using the map function from the DataStream API. The streaming message is manipulated, and only the ones containing a string are passed downstream.
The parse_and_filter function is declared as follows. The filtering process is relatively simple, and the result is a string with a message declared as alert_message. Notice that we’re expecting a string as the function input because we’ve configured it as a SimpleStringSchema to deserialize Kafka messages.
The final step in every Flink script is to actually execute the environment. This is achieved with the execute method.
This setup provides a simple yet effective demonstration of how to use Kafka and Flink for processing IoT sensor data on a local machine.
Next Steps
In the next part of this series, we'll take our streaming architecture to the next level by moving beyond the local environment. Here’s a sneak peek at what we’ll cover:
1. Use Containers to Run Services in a Provisioned Cluster
Running our services on a local machine is great for development and testing, but to simulate real-world conditions more closely, we'll move to a containerized environment. We’ll use Docker to containerize our Kafka and Flink services and deploy them in a provisioned cluster. This setup will mimic cloud infrastructure, allowing us to test our streaming pipeline under conditions similar to those in a production environment. We’ll explore how to set up Docker Compose files to manage multi-container Docker applications, ensuring all services are correctly networked and configured.
2. Manage Flink Jobs through the GUI Interface
Once we have our cluster up and running, managing Flink jobs efficiently becomes crucial. We'll dive into the Flink Dashboard, a powerful GUI interface that simplifies job management. Through the dashboard, we’ll show you how to submit, monitor, and manage Flink jobs. This visual tool will help us gain insights into job performance, identify bottlenecks, and troubleshoot issues more effectively.
3. Transform Data Using the SQL API
Finally, we’ll explore the power of the SQL API in Flink. This feature allows us to write SQL queries to perform complex transformations on our data streams. We’ll demonstrate how to use the SQL API to filter, aggregate, and join data streams, making our data processing pipeline more robust and versatile. This approach leverages the familiarity and expressiveness of SQL, making stream processing more accessible and powerful.
Closing Notes
As we wrap up this second part of our journey into building a streaming architecture, it's important to highlight a few key takeaways:
Simplicity First
When designing a streaming data pipeline, simplicity should always be a priority. If your transformations are straightforward, such as basic filtering or aggregation, consider whether these tasks can be performed directly on Kafka using Kafka Streams or KSQL. These tools allow for powerful stream processing capabilities directly within Kafka, reducing the need for additional components. This can simplify your architecture and reduce operational complexity.
Synergy of Components
The synergy between Kafka and Flink is a powerful aspect of our setup. Kafka serves as a robust storage and messaging system, efficiently handling high-throughput data streams. Flink complements this by providing powerful compute capabilities for complex data transformations and real-time analytics. Together, they form a resilient and scalable platform for processing IoT sensor data and many other streaming applications.
Flink Adoptability
Apache Flink's flexibility and comprehensive feature set make it an excellent choice for stream processing. Its support for various APIs, including DataStream and SQL, makes it accessible to both developers and data analysts. As you become more familiar with Flink, you'll find that it can handle a wide range of use cases, from simple event-driven applications to complex data analytics workflows.
By focusing on simplicity, leveraging the strengths of Kafka and Flink, and embracing Flink’s robust capabilities, you can build a highly effective and scalable streaming data pipeline.?
We look forward to exploring these topics further and helping you take your streaming architecture to new heights. Stay tuned for the next part of our series!
Writer | Coach
9 个月Can't wait to read your insights on data streaming with Apache Flink.