登录查看更多内容

Production architecture of a streaming analytics data pipeline

Salah Uddin

Consultant | Data Science | Data Engineering

发布日期: 2023年2月13日

Structured streaming analytics workflow can consume significant resources. An efficient and reliable orchestration can help reducing cluster expense while ensuring ETL objectives, application of machine learning model and, most importantly, timely communication of model outputs to all stakeholders.?

This article describes a scenario where streaming data is intermittently received in a storage container, process the data, apply machine learning model and finally, per model output, notifications are sent. This intermittent nature requires to spin up and run cluster as needed and terminate while idle (no streaming data). Data pipeline, shown above, has addressed this scenario using Data Factory as orchestration framework and Slack as the real-time notification service.

Detailed codes are shared in Github -?https://github.com/uddin007/credit-fraud-streaming-analytics

A brief summary of production architecture and components are as follows. Azure Data Factory (ADF) is used as the orchestration framework.

Azure BlobEventsTrigger

Create pipeline trigger based on new file arrival at source container
The objective is to trigger the pipeline as new streaming data arrives
Type of trigger is BlobEventsTrigger where 'blob path ends with' .json

ADF pipeline

Pipeline starts with 'Get Metadata' to check if specific file exists, this is to ensure Databricks Job is not created while one is already running.
The second part of the pipeline is an IF Condition. Based on previous event, if a job is not already running, it will move to the next stage i.e. to trigger a Databricks job
Finally, a Databricks Job is triggered with the specified Job Cluster configured using the Linked Service.

Databricks Job termination

This is to ensure a Job is not continuously running if there is no new data at the source container.
Python UDFs are developed and applied to monitor new file arrivals at source location.
If files are not arriving for predefined period, UDFs are used to Gracefully Shutdown all active streams.
This will subsequently terminate the job cluster

Data storage and notification

In purpose of sending slack notification, following items need to be setup

A slack workspace
One channel for each user
Create Slack App and configure Incoming Webhooks

领英推荐

Kafka for Data Engineers

Ankur Ranjan 2 年前

Seamless Data Streaming: How to Integrate Kafka with…

Reckonsys Tech Labs 2 个月前

DATA Pill #093 - Looker, Streaming Analytics with…

Adam Kawa 1 年前

Create member profile

Create a member data using Webhook URL, user ID and nameOrig
Join member data with transaction record

No alt text provided for this image — Real-time slack notfication

Apply the registered slack notification function?slack_notification

Function has three arguments i.e. prediction, userid and webhook_url
It evaluates each prediction output and send notification to the account owner
In purpose of the demo, only three user accounts were used user01, user02 and user03
Note: in real production scenerio, there will be far more users and, most likely, far less fraud transactions (output '1')

Finally, notification status is marked on the final table, as follows:

要查看或添加评论，请登录

Salah Uddin的更多文章

Knowledge-graph

2024年7月26日

Knowledge-graph

A Knowledge Graph is developed using Spark Framework within Databricks and represented in Neo4j to establish…
Stock Analytics using Social Media and News Feeds

2023年4月27日

Stock Analytics using Social Media and News Feeds

This objective of this project is to develop analytics on stocks using social media and news feeds. The backend is…

1 条评论
Credit fraud streaming analytics

2023年1月14日

Credit fraud streaming analytics

The article is focused on developing real-time fraud detection using Spark Structured Streaming and Streaming…

Production architecture of a streaming analytics data pipeline

Salah Uddin

Consultant | Data Science | Data Engineering

Azure BlobEventsTrigger

ADF pipeline

Databricks Job termination

Data storage and notification

领英推荐

Salah Uddin的更多文章

社区洞察

其他会员也浏览了

How mid-sized companies use Kafka for real business challenges

Spark Structured Streaming

Bridging Batch and Streaming Data with Incremental Computation: The Power of Feldera

Databricks Advances Spark Structured Streaming With Project Lightspeed ..and RocksDB Stutters

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

Mastering Streaming Data Pipelines with Kappa Architecture

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

Mastering Data Processing: Batch vs. Stream with Apache Spark Structured Streaming

Azure BlobEventsTrigger

ADF pipeline

Databricks Job termination

Data storage and notification

领英推荐

Salah Uddin的更多文章

Knowledge-graph

Stock Analytics using Social Media and News Feeds

Credit fraud streaming analytics

社区洞察

其他会员也浏览了

How mid-sized companies use Kafka for real business challenges

Spark Structured Streaming

Bridging Batch and Streaming Data with Incremental Computation: The Power of Feldera

Databricks Advances Spark Structured Streaming With Project Lightspeed ..and RocksDB Stutters

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

Mastering Streaming Data Pipelines with Kappa Architecture

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

Mastering Data Processing: Batch vs. Stream with Apache Spark Structured Streaming