登录查看更多内容

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

SNEHASISH DUTTA

Professional Data Engineer

发布日期: 2024年6月21日

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments. —Donald Knuth

This quote by Donald Knuth (in Designing Data Intensive Applications Chapter 10) only resonated with me, whenever I had to prepare for a job interview.

Since I mentioned this book here, Focus in this topic will remain on the Chapter 11 Stream Processing.

Apache Flink ::

Born at TU Berlin in 2015, Apache Flink has become a leading project within the Apache Software Foundation.
The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala
Applications can be developed in Java, Scala or any other JVM based language and Python (PyFlink) , also Flink-SQL is now adopted by Confluence replacing Kafka K-SQL.

Apache Kafka ::

Apache Kafka is a heavyweight in the world of big data. Here's a quick rundown of its key features:

Real-time Streaming: Kafka excels at handling high-volume streams of data, making it ideal for applications that need to react to events as they happen.
Distributed and Scalable: Built for large-scale deployments, Kafka can handle massive data loads by efficiently distributing data across multiple servers.
Fault Tolerant: Kafka ensures data durability and availability even in case of server failures.
Messaging System: At its core, Kafka acts as a robust messaging system, allowing applications to publish and subscribe to data streams.

In a nutshell, Kafka provides a powerful platform for capturing, storing, and processing real-time data streams, making it a crucial tool for modern data pipelines.

Why Flink over Apache Spark ?

Here's a breakdown of why Apache Flink might be advantageous over Apache Spark for stream processing scenarios:

Low Latency:

Flink: Designed for real-time stream processing with low latency (milliseconds).
Spark: While Spark Streaming offers micro-batching for near real-time processing, it can still introduce some latency due to batch processing intervals.

State Management:

Flink: Offers built-in state management for keeping track of information about the data stream, enabling complex operations like windowing and aggregations.
Spark: Requires additional libraries or workarounds for efficient state management in streaming applications.

Exactly-Once Processing:

Flink: Can guarantee exactly-once processing semantics, ensuring that each event is processed only once, even in case of failures.
Spark: Offers at-least-once or at-most-once semantics by default, potentially leading to duplicate processing or data loss in some scenarios.

Scalability:

Flink: Highly scalable due to its distributed architecture and parallel processing capabilities.
Spark: While Spark scales well, Flink might be more efficient for very high-volume data streams due to its focus on stream processing.

API Focus:

Flink: Offers a unified API for both batch and stream processing, making it easier to develop applications that work with both data types.
Spark: Requires separate APIs for batch processing (Spark SQL) and stream processing (Spark Streaming), potentially increasing development complexity.

Here's an example to illustrate the advantage:

Imagine you're processing real-time e-commerce transactions to detect fraudulent activity. Low latency is crucial to identify and block suspicious transactions quickly. Flink's low-latency processing and state management capabilities would be advantageous compared to Spark Streaming's micro-batching approach.

However, Spark Streaming still has its advantages:

Wider Ecosystem: Spark has a broader ecosystem of tools and libraries that might be useful for certain tasks.
Learning Curve: If you're already familiar with Spark for batch processing, Spark Streaming might have a smoother learning curve.

Ultimately, the best choice depends on your specific needs and priorities. If low latency, state management, and exactly-once processing are critical, Flink is a strong contender for your stream processing tasks.

Why I have chosen Scala?

Flink Libraries are readily available in Java
Coming from Spark / Scala Case Class background the switch was from Dataset (Spark) to Datastream (Flink)
Stay Tuned for the PyFlink Experiment as well

A Funny answer is Rock the JVM

Project Architecture

Source :: Apache Kafka
ETL Framework :: Apache Flink
Sink :: PostgreSQL and ElasticSearch

领英推荐

FLaNK-AIM: 13 May 2024

Tim Spann 10 个月前

Apache Spark: Features, Use Cases, Advantages &…

Anush K. 2 年前

WAT IS SPARK

Ashish Ranjan 1 年前

Extraction From Kafka Using Flink

In Part 1 of this Series I will only discuss about ::

Setting up the infrastructure
Reading From Kafka

Setting up the infrastructure

Infrastructure Repository

This involves two steps ::

Docker Image setup https://github.com/snepar/flink-ecom-infra/blob/master/docker-compose.yml
Generate Fake Events and Push To Kafka https://github.com/snepar/flink-ecom-infra/blob/master/main.py

Once Cloned the Repository setup Python Virtual Environment , Install Docker Desktop / Docker Compose and execute ::

docker compose up -d

Followed by that Run the main.py

From the Docker Console Kafka Broker you may test the consumer

kafka-console-consumer --topic financial_transactions --bootstrap-server broker:9092

Setup Apache Flink Kafka Source

In the python function above we generated payloads for e-commerce transaction, now we define a case class for the same

case class Transaction(
                        transactionId: String,
                        productId: String,
                        productName: String,
                        productCategory: String,
                        productPrice: Double,
                        productQuantity: Int,
                        productBrand: String,
                        totalAmount: Double,
                        currency: String,
                        customerId: String,
                        transactionDate: java.sql.Timestamp,
                        paymentMethod: String
                      )

We also require a deserialiser to read the messages from Kafka

class JSONValueDeserializationSchema extends DeserializationSchema[Transaction] with Serializable {

  override def deserialize(message: Array[Byte]): Transaction = {
     val json = new String(message)
     val mapper = new ObjectMapper()
     mapper.registerModule(DefaultScalaModule) //this is essential for case classes
     mapper.readValue(json, classOf[Transaction])
  }

  override def isEndOfStream(nextElement: Transaction): Boolean = false

  override def getProducedType: TypeInformation[Transaction] = TypeInformation.of(classOf[Transaction])
}

Setup your Datastream with StreamExecutionEnvironment

val env = StreamExecutionEnvironment.getExecutionEnvironment

def readCustomData(): DataStream[Transaction] = {
    val topic = "financial_transactions"
    val source = KafkaSource.builder[Transaction]()
      .setBootstrapServers("localhost:9092")
      .setTopics(topic)
      .setGroupId("ecom-group")
      .setStartingOffsets(OffsetsInitializer.earliest())
      .setValueOnlyDeserializer(new JSONValueDeserializationSchema())
      .build()

    val transactionStream: DataStream[Transaction] = env.fromSource(source, watermarkStrategy = WatermarkStrategy.noWatermarks(), "ecom")

    transactionStream
  }

In the Main Method read from Kafka and Print on console

val transactionStream = readCustomData()
transactionStream.print()

In Part 2 of this series I will discuss about Transformations and Aggregations on the fly , And writing to Sink (PostgresSQL)
In Part 3 of this series I will be sharing on writing to Elastic Search and Real-time Dashboards creation on Kibana

Here is a glimpse to the end-to-end implementation https://dev.to/snehasish_dutta_007/apache-flink-e-commerce-data-pipeline-usecase-3ha9

Acknowledgements and Inspirations

#data #bigdata #apacheflink #scala #java #kafka #streaming #python

Pablo Cocko

Senior Data Engineer in SCRM

9 个月

Good lecture SNEHASISH DUTTA . Looking forward to read the second part of this serie. We didn’t have too much time for discussing this case in Berlin

1 次回应

查看更多评论

要查看或添加评论，请登录

SNEHASISH DUTTA的更多文章

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

2024年12月29日

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

Picture this: You're browsing one day, and you stumble upon a post that catches your eye: Hi, I found my wife cheating…

1 条评论
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

2024年12月29日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

Continuing my exploration of Kafka and Flink (PyFlink and Flink-SQL), I began the algorithmic trading use case series…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

2024年12月28日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

Hey I just met you, The network’s laggy, But here’s my data, So store it maybe. — Kyle Kingsbury, Carly Rae Jepsen and…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

2024年12月27日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

The limits of my language mean the limits of my world. — Ludwig Wittgenstein, Tractatus Logico-Philosophicus (Designing…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

2024年12月26日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

Trading is no longer just about price action – it's about turning market sentiment into actionable data streams…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

2024年12月25日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

Where data flows, alpha grows: The modern trader doesn't just read charts, they engineer insights. When a Data Engineer…
Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

2024年7月10日

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Wer Ordnung h?lt, ist nur zu faul zum Suchen (If you keep things tidily ordered, you’re just too lazy to go searching.)…
Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

2024年6月24日

Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

You can have data without information, but you cannot have information without data. —Daniel Keys Moran, an American…

1 条评论

See all articles

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

SNEHASISH DUTTA

Professional Data Engineer

领英推荐

SNEHASISH DUTTA的更多文章

社区洞察

其他会员也浏览了

Apache Beam Tutorial

Apache Spark

BigData Analytics with PySpark

How to Spot and Fix Performance Problems in Apache Spark

WHAT IS SPARK

Apache Spark

Apache Spark - Memory Allocation

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

领英推荐

SNEHASISH DUTTA的更多文章

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

社区洞察

其他会员也浏览了

Apache Beam Tutorial

Apache Spark

BigData Analytics with PySpark

How to Spot and Fix Performance Problems in Apache Spark

WHAT IS SPARK

Apache Spark

Apache Spark - Memory Allocation

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing