Production architecture of a streaming analytics data pipeline
Structured streaming analytics workflow can consume significant resources. An efficient and reliable orchestration can help reducing cluster expense while ensuring ETL objectives, application of machine learning model and, most importantly, timely communication of model outputs to all stakeholders.?
This article describes a scenario where streaming data is intermittently received in a storage container, process the data, apply machine learning model and finally, per model output, notifications are sent. This intermittent nature requires to spin up and run cluster as needed and terminate while idle (no streaming data). Data pipeline, shown above, has addressed this scenario using Data Factory as orchestration framework and Slack as the real-time notification service.
Detailed codes are shared in Github -?https://github.com/uddin007/credit-fraud-streaming-analytics
A brief summary of production architecture and components are as follows. Azure Data Factory (ADF) is used as the orchestration framework.
Azure BlobEventsTrigger
ADF pipeline
Databricks Job termination
Data storage and notification
In purpose of sending slack notification, following items need to be setup
领英推荐
Create member profile
Apply the registered slack notification function?slack_notification
Finally, notification status is marked on the final table, as follows: