Cheers to Real-time Analytics with Apache Flink : Part 1 of 3
AI generated

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments. —Donald Knuth

This quote by Donald Knuth (in Designing Data Intensive Applications Chapter 10) only resonated with me, whenever I had to prepare for a job interview.

Since I mentioned this book here, Focus in this topic will remain on the Chapter 11 Stream Processing.

Apache Flink ::

  • Born at TU Berlin in 2015, Apache Flink has become a leading project within the Apache Software Foundation.
  • The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala
  • Applications can be developed in Java, Scala or any other JVM based language and Python (PyFlink) , also Flink-SQL is now adopted by Confluence replacing Kafka K-SQL.

Apache Kafka ::

Apache Kafka is a heavyweight in the world of big data. Here's a quick rundown of its key features:

  • Real-time Streaming: Kafka excels at handling high-volume streams of data, making it ideal for applications that need to react to events as they happen.
  • Distributed and Scalable: Built for large-scale deployments, Kafka can handle massive data loads by efficiently distributing data across multiple servers.
  • Fault Tolerant: Kafka ensures data durability and availability even in case of server failures.
  • Messaging System: At its core, Kafka acts as a robust messaging system, allowing applications to publish and subscribe to data streams.

In a nutshell, Kafka provides a powerful platform for capturing, storing, and processing real-time data streams, making it a crucial tool for modern data pipelines.


Why Flink over Apache Spark ?

Here's a breakdown of why Apache Flink might be advantageous over Apache Spark for stream processing scenarios:

Low Latency:

  • Flink: Designed for real-time stream processing with low latency (milliseconds).
  • Spark: While Spark Streaming offers micro-batching for near real-time processing, it can still introduce some latency due to batch processing intervals.

State Management:

  • Flink: Offers built-in state management for keeping track of information about the data stream, enabling complex operations like windowing and aggregations.
  • Spark: Requires additional libraries or workarounds for efficient state management in streaming applications.

Exactly-Once Processing:

  • Flink: Can guarantee exactly-once processing semantics, ensuring that each event is processed only once, even in case of failures.
  • Spark: Offers at-least-once or at-most-once semantics by default, potentially leading to duplicate processing or data loss in some scenarios.

Scalability:

  • Flink: Highly scalable due to its distributed architecture and parallel processing capabilities.
  • Spark: While Spark scales well, Flink might be more efficient for very high-volume data streams due to its focus on stream processing.

API Focus:

  • Flink: Offers a unified API for both batch and stream processing, making it easier to develop applications that work with both data types.
  • Spark: Requires separate APIs for batch processing (Spark SQL) and stream processing (Spark Streaming), potentially increasing development complexity.

Here's an example to illustrate the advantage:

Imagine you're processing real-time e-commerce transactions to detect fraudulent activity. Low latency is crucial to identify and block suspicious transactions quickly. Flink's low-latency processing and state management capabilities would be advantageous compared to Spark Streaming's micro-batching approach.

However, Spark Streaming still has its advantages:

  • Wider Ecosystem: Spark has a broader ecosystem of tools and libraries that might be useful for certain tasks.
  • Learning Curve: If you're already familiar with Spark for batch processing, Spark Streaming might have a smoother learning curve.

Ultimately, the best choice depends on your specific needs and priorities. If low latency, state management, and exactly-once processing are critical, Flink is a strong contender for your stream processing tasks.


Why I have chosen Scala?

  • Flink Libraries are readily available in Java
  • Coming from Spark / Scala Case Class background the switch was from Dataset (Spark) to Datastream (Flink)
  • Stay Tuned for the PyFlink Experiment as well

A Funny answer is Rock the JVM


Project Architecture

  • Source :: Apache Kafka
  • ETL Framework :: Apache Flink
  • Sink :: PostgreSQL and ElasticSearch

ES and Kibana For Realtime Dashboarding

Extraction From Kafka Using Flink

In Part 1 of this Series I will only discuss about ::

  • Setting up the infrastructure
  • Reading From Kafka


Setting up the infrastructure

Infrastructure Repository

This involves two steps ::

Once Cloned the Repository setup Python Virtual Environment , Install Docker Desktop / Docker Compose and execute ::

docker compose up -d        

Followed by that Run the main.py

Python Execution Output

From the Docker Console Kafka Broker you may test the consumer

kafka-console-consumer --topic financial_transactions --bootstrap-server broker:9092        
Kafka Consumer Output

Setup Apache Flink Kafka Source

In the python function above we generated payloads for e-commerce transaction, now we define a case class for the same

case class Transaction(
                        transactionId: String,
                        productId: String,
                        productName: String,
                        productCategory: String,
                        productPrice: Double,
                        productQuantity: Int,
                        productBrand: String,
                        totalAmount: Double,
                        currency: String,
                        customerId: String,
                        transactionDate: java.sql.Timestamp,
                        paymentMethod: String
                      )        

We also require a deserialiser to read the messages from Kafka

class JSONValueDeserializationSchema extends DeserializationSchema[Transaction] with Serializable {

  override def deserialize(message: Array[Byte]): Transaction = {
     val json = new String(message)
     val mapper = new ObjectMapper()
     mapper.registerModule(DefaultScalaModule) //this is essential for case classes
     mapper.readValue(json, classOf[Transaction])
  }

  override def isEndOfStream(nextElement: Transaction): Boolean = false

  override def getProducedType: TypeInformation[Transaction] = TypeInformation.of(classOf[Transaction])
}        

Setup your Datastream with StreamExecutionEnvironment

val env = StreamExecutionEnvironment.getExecutionEnvironment

def readCustomData(): DataStream[Transaction] = {
    val topic = "financial_transactions"
    val source = KafkaSource.builder[Transaction]()
      .setBootstrapServers("localhost:9092")
      .setTopics(topic)
      .setGroupId("ecom-group")
      .setStartingOffsets(OffsetsInitializer.earliest())
      .setValueOnlyDeserializer(new JSONValueDeserializationSchema())
      .build()

    val transactionStream: DataStream[Transaction] = env.fromSource(source, watermarkStrategy = WatermarkStrategy.noWatermarks(), "ecom")

    transactionStream
  }        

In the Main Method read from Kafka and Print on console

val transactionStream = readCustomData()
transactionStream.print()        
IDE Output

  • In Part 2 of this series I will discuss about Transformations and Aggregations on the fly , And writing to Sink (PostgresSQL)
  • In Part 3 of this series I will be sharing on writing to Elastic Search and Real-time Dashboards creation on Kibana


Here is a glimpse to the end-to-end implementation https://dev.to/snehasish_dutta_007/apache-flink-e-commerce-data-pipeline-usecase-3ha9


Acknowledgements and Inspirations


#data #bigdata #apacheflink #scala #java #kafka #streaming #python

Pablo Cocko

Senior Data Engineer in SCRM

9 个月

Good lecture SNEHASISH DUTTA . Looking forward to read the second part of this serie. We didn’t have too much time for discussing this case in Berlin

要查看或添加评论,请登录

SNEHASISH DUTTA的更多文章

社区洞察

其他会员也浏览了