Cheers to Real-time Analytics with Apache Flink : Part 1 of 3
A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments. —Donald Knuth
This quote by Donald Knuth (in Designing Data Intensive Applications Chapter 10) only resonated with me, whenever I had to prepare for a job interview.
Since I mentioned this book here, Focus in this topic will remain on the Chapter 11 Stream Processing.
Apache Flink ::
Apache Kafka ::
Apache Kafka is a heavyweight in the world of big data. Here's a quick rundown of its key features:
In a nutshell, Kafka provides a powerful platform for capturing, storing, and processing real-time data streams, making it a crucial tool for modern data pipelines.
Why Flink over Apache Spark ?
Here's a breakdown of why Apache Flink might be advantageous over Apache Spark for stream processing scenarios:
Low Latency:
State Management:
Exactly-Once Processing:
Scalability:
API Focus:
Here's an example to illustrate the advantage:
Imagine you're processing real-time e-commerce transactions to detect fraudulent activity. Low latency is crucial to identify and block suspicious transactions quickly. Flink's low-latency processing and state management capabilities would be advantageous compared to Spark Streaming's micro-batching approach.
However, Spark Streaming still has its advantages:
Ultimately, the best choice depends on your specific needs and priorities. If low latency, state management, and exactly-once processing are critical, Flink is a strong contender for your stream processing tasks.
Why I have chosen Scala?
A Funny answer is Rock the JVM
Project Architecture
Extraction From Kafka Using Flink
In Part 1 of this Series I will only discuss about ::
Setting up the infrastructure
This involves two steps ::
Once Cloned the Repository setup Python Virtual Environment , Install Docker Desktop / Docker Compose and execute ::
docker compose up -d
Followed by that Run the main.py
From the Docker Console Kafka Broker you may test the consumer
kafka-console-consumer --topic financial_transactions --bootstrap-server broker:9092
Setup Apache Flink Kafka Source
In the python function above we generated payloads for e-commerce transaction, now we define a case class for the same
case class Transaction(
transactionId: String,
productId: String,
productName: String,
productCategory: String,
productPrice: Double,
productQuantity: Int,
productBrand: String,
totalAmount: Double,
currency: String,
customerId: String,
transactionDate: java.sql.Timestamp,
paymentMethod: String
)
We also require a deserialiser to read the messages from Kafka
class JSONValueDeserializationSchema extends DeserializationSchema[Transaction] with Serializable {
override def deserialize(message: Array[Byte]): Transaction = {
val json = new String(message)
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule) //this is essential for case classes
mapper.readValue(json, classOf[Transaction])
}
override def isEndOfStream(nextElement: Transaction): Boolean = false
override def getProducedType: TypeInformation[Transaction] = TypeInformation.of(classOf[Transaction])
}
Setup your Datastream with StreamExecutionEnvironment
val env = StreamExecutionEnvironment.getExecutionEnvironment
def readCustomData(): DataStream[Transaction] = {
val topic = "financial_transactions"
val source = KafkaSource.builder[Transaction]()
.setBootstrapServers("localhost:9092")
.setTopics(topic)
.setGroupId("ecom-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new JSONValueDeserializationSchema())
.build()
val transactionStream: DataStream[Transaction] = env.fromSource(source, watermarkStrategy = WatermarkStrategy.noWatermarks(), "ecom")
transactionStream
}
In the Main Method read from Kafka and Print on console
val transactionStream = readCustomData()
transactionStream.print()
Here is a glimpse to the end-to-end implementation https://dev.to/snehasish_dutta_007/apache-flink-e-commerce-data-pipeline-usecase-3ha9
Acknowledgements and Inspirations
#data #bigdata #apacheflink #scala #java #kafka #streaming #python
Senior Data Engineer in SCRM
9 个月Good lecture SNEHASISH DUTTA . Looking forward to read the second part of this serie. We didn’t have too much time for discussing this case in Berlin