登录查看更多内容

Kafka/Spark Streaming System - Telecom Case Study

Dr.Abdur Rahman Author,ICF-PCC,SPC,AWS-SA,ACP,CSM,CPO

SVP Agile & Data Transformation & Delivery

发布日期: 2024年1月16日

Kafka was originally built for massive log processing. It retains messages until expiration and lets consumers pull messages at their own pace. Unlike its predecessors, Kafka is more than a message queue, it is an open-source event streaming platform for various cases.

Let us review some Case Use case

1. Log processing and analysis

The diagram below shows a typical ELK (Elastic-Logstash-Kibana) stack. Kafka efficiently collects log streams from each instance.
ElasticSearch consumes the logs from Kafka and indexes them. Kibana provides a search and visualization UI on top of ElasticSearch.

2. Real-world case study in the telecom industry

use case :

The Kafka/Spark Streaming system aims to provide better customer support by providing their support staff with always up-to-date call quality information for all their mobile customers.

Mobile customers, while making calls and using data, connect to the operator’s infrastructure and generate logs in many different systems. Three specific logs were identified that, if correlated with each other, give visibility in the actual quality of service experienced by each individual customer. The three logs were selected because they can be correlated through a simple relational database-like join operation.

For improving customer support, the quality of call information needs to be kept updated in near to real time; otherwise, it has no value. This has led, down the road, to building a streaming architecture rather than a batch job. The data volume at production load reaches several GB/s, generated by several million mobile customers, 24 hours a day, 365 days a year. Performance and stability at that scale is required for the system to reach production.

Data Sources

The raw data source are the logs of three remote systems, labeled A, B, and C here, where the log from A comprises about 84-85% of the entries, the log from B about 1-2%, and the log from C about 14-15%. The fact that the data is unbalanced is one of the (many) sources of difficulty in this application.

领英推荐

Streaming Data for Everyone

Greylock 8 个月前

Spark Streaming - Part 1

Ankur Ranjan 1 年前

Data Streaming and Message Brokers

Saurav Prateek 1 年前

The raw data is ingested into the system by a single Kafka producer into Kafka running on 6 servers. The producer reads the various logs and adds each log's records into its own topic. As there are three logs, there are three Kafka topics.

Spark Streaming

The data is consumed by a Spark Streaming application, which picks up each topic, does a simple filter to cut out unnecessary fields, a map operation to transform the data, and then a foreachRDD operation (each micro-batch generates an RDD in Spark Streaming) that saves the data to Ignite and to HDFS as Hive tables for backup.

Spark

A second batch Spark application runs once per hour on the data stored in-memory in Ignite to join the records from the three separate logs into a single table. The batch job has a maximum data size of about 100GB. The cluster CPU resources should be sufficient to process this amount of data in one hour or less.

Ignite

Ignite stores 3 hours’ worth of data at all time to account for calls that begin in one hour and end in the hour getting processed, as well as calls that begin in the target hour and end in the next one. The telecom operator judges that calls that are so long they aren’t captured in this scheme can be ignored, as they are very rare.It’s worth noting that a better all-streaming architecture could have avoided the whole issue with the intermediate representation in the first place. An illustrative, real-world case with more time and thought upfront can make the entire project end faster than just rushing headlong into coding the first working solution that comes to mind.

Final System Architecture

Kafka/Spark Streaming System - Telecom Case Study

Dr.Abdur Rahman Author,ICF-PCC,SPC,AWS-SA,ACP,CSM,CPO

SVP Agile & Data Transformation & Delivery

1. Log processing and analysis

2. Real-world case study in the telecom industry

use case :

Data Sources

领英推荐

PI Planing

171 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

KAFKA

Data Streaming Services on AWS

Netflix & Amazon Kinesis Data Streams Case Study: Why It Remains Crucial for Them

Data Streaming

Real-Time Features, Real-Time Results: Exploring Streaming Feature Stores with Kafka

A comprehensive guide to event streaming technologies: Kafka and its alternatives

StreamNative September Newsletter

Ockam and Redpanda: Launching the Future of Secure Data Streaming

Data Streaming Services on AWS

Choosing a smaller vendor for Apache Kafka: exploring the benefits

1. Log processing and analysis

2. Real-world case study in the telecom industry

use case :

Data Sources

领英推荐

PI Planing

171 位关注者

Hadoop to Azure Databricks Migration

2024年10月18日

KAFKA -Fundamentals 2

2024年1月21日

Kafka Basics

2024年1月16日

AI to streamline your recruitment and selection process

2023年9月24日

Effective PI Planning

2023年1月16日

Top 12 mistakes to avoid during PI Planning

2023年1月16日

PI Planing 101

2022年12月27日

Agile User story splitting and examples

2022年10月18日

User story mapping 101: What it is, who does it, and when it happens

2022年10月14日

What is OKR

2022年10月14日

社区洞察

其他会员也浏览了

KAFKA

Data Streaming Services on AWS

Netflix & Amazon Kinesis Data Streams Case Study: Why It Remains Crucial for Them

Data Streaming

Real-Time Features, Real-Time Results: Exploring Streaming Feature Stores with Kafka

A comprehensive guide to event streaming technologies: Kafka and its alternatives

StreamNative September Newsletter

Ockam and Redpanda: Launching the Future of Secure Data Streaming

Data Streaming Services on AWS

Choosing a smaller vendor for Apache Kafka: exploring the benefits