Production architecture of a streaming analytics data pipeline
Structured streaming orchestration framework

Production architecture of a streaming analytics data pipeline

Structured streaming analytics workflow can consume significant resources. An efficient and reliable orchestration can help reducing cluster expense while ensuring ETL objectives, application of machine learning model and, most importantly, timely communication of model outputs to all stakeholders.?

This article describes a scenario where streaming data is intermittently received in a storage container, process the data, apply machine learning model and finally, per model output, notifications are sent. This intermittent nature requires to spin up and run cluster as needed and terminate while idle (no streaming data). Data pipeline, shown above, has addressed this scenario using Data Factory as orchestration framework and Slack as the real-time notification service.

Detailed codes are shared in Github -?https://github.com/uddin007/credit-fraud-streaming-analytics

A brief summary of production architecture and components are as follows. Azure Data Factory (ADF) is used as the orchestration framework.

Azure BlobEventsTrigger

  • Create pipeline trigger based on new file arrival at source container
  • The objective is to trigger the pipeline as new streaming data arrives
  • Type of trigger is BlobEventsTrigger where 'blob path ends with' .json

ADF pipeline

  • Pipeline starts with 'Get Metadata' to check if specific file exists, this is to ensure Databricks Job is not created while one is already running.
  • The second part of the pipeline is an IF Condition. Based on previous event, if a job is not already running, it will move to the next stage i.e. to trigger a Databricks job
  • Finally, a Databricks Job is triggered with the specified Job Cluster configured using the Linked Service.

Databricks Job termination

  • This is to ensure a Job is not continuously running if there is no new data at the source container.
  • Python UDFs are developed and applied to monitor new file arrivals at source location.
  • If files are not arriving for predefined period, UDFs are used to Gracefully Shutdown all active streams.
  • This will subsequently terminate the job cluster

Data storage and notification

In purpose of sending slack notification, following items need to be setup

  • A slack workspace
  • One channel for each user
  • Create Slack App and configure Incoming Webhooks

Create member profile

  • Create a member data using Webhook URL, user ID and nameOrig
  • Join member data with transaction record


No alt text provided for this image
Real-time slack notfication


Apply the registered slack notification function?slack_notification

  • Function has three arguments i.e. prediction, userid and webhook_url
  • It evaluates each prediction output and send notification to the account owner
  • In purpose of the demo, only three user accounts were used user01, user02 and user03
  • Note: in real production scenerio, there will be far more users and, most likely, far less fraud transactions (output '1')

Finally, notification status is marked on the final table, as follows:

No alt text provided for this image
Model output and notification status




要查看或添加评论,请登录

Salah Uddin的更多文章

  • Knowledge-graph

    Knowledge-graph

    A Knowledge Graph is developed using Spark Framework within Databricks and represented in Neo4j to establish…

  • Stock Analytics using Social Media and News Feeds

    Stock Analytics using Social Media and News Feeds

    This objective of this project is to develop analytics on stocks using social media and news feeds. The backend is…

    1 条评论
  • Credit fraud streaming analytics

    Credit fraud streaming analytics

    The article is focused on developing real-time fraud detection using Spark Structured Streaming and Streaming…

社区洞察

其他会员也浏览了