Kafka/Spark Streaming System - Telecom Case Study
Dr.Abdur Rahman Author,ICF-PCC,SPC,AWS-SA,ACP,CSM,CPO
SVP Agile & Data Transformation & Delivery
Kafka was originally built for massive log processing. It retains messages until expiration and lets consumers pull messages at their own pace. Unlike its predecessors, Kafka is more than a message queue, it is an open-source event streaming platform for various cases.
Let us review some Case Use case
1. Log processing and analysis
2. Real-world case study in the telecom industry
use case :
The Kafka/Spark Streaming system aims to provide better customer support by providing their support staff with always up-to-date call quality information for all their mobile customers.
Mobile customers, while making calls and using data, connect to the operator’s infrastructure and generate logs in many different systems. Three specific logs were identified that, if correlated with each other, give visibility in the actual quality of service experienced by each individual customer. The three logs were selected because they can be correlated through a simple relational database-like join operation.
For improving customer support, the quality of call information needs to be kept updated in near to real time; otherwise, it has no value. This has led, down the road, to building a streaming architecture rather than a batch job. The data volume at production load reaches several GB/s, generated by several million mobile customers, 24 hours a day, 365 days a year. Performance and stability at that scale is required for the system to reach production.
Data Sources
The raw data source are the logs of three remote systems, labeled A, B, and C here, where the log from A comprises about 84-85% of the entries, the log from B about 1-2%, and the log from C about 14-15%. The fact that the data is unbalanced is one of the (many) sources of difficulty in this application.
The raw data is ingested into the system by a single Kafka producer into Kafka running on 6 servers. The producer reads the various logs and adds each log's records into its own topic. As there are three logs, there are three Kafka topics.
Spark Streaming
The data is consumed by a Spark Streaming application, which picks up each topic, does a simple filter to cut out unnecessary fields, a map operation to transform the data, and then a foreachRDD operation (each micro-batch generates an RDD in Spark Streaming) that saves the data to Ignite and to HDFS as Hive tables for backup.
Spark
A second batch Spark application runs once per hour on the data stored in-memory in Ignite to join the records from the three separate logs into a single table. The batch job has a maximum data size of about 100GB. The cluster CPU resources should be sufficient to process this amount of data in one hour or less.
Ignite
Ignite stores 3 hours’ worth of data at all time to account for calls that begin in one hour and end in the hour getting processed, as well as calls that begin in the target hour and end in the next one. The telecom operator judges that calls that are so long they aren’t captured in this scheme can be ignored, as they are very rare.It’s worth noting that a better all-streaming architecture could have avoided the whole issue with the intermediate representation in the first place. An illustrative, real-world case with more time and thought upfront can make the entire project end faster than just rushing headlong into coding the first working solution that comes to mind.