Data Pipeline Monitoring
Rajaraman Sathyamurthy
Associate Director & Senior Architect, Data Architecture
There are many reasons, a data pipeline could break. Most of the time it is due to issues on the data source. Either the servers from where we are ingesting data from, could be down, or there are some connectivity issues, authentication issues etc. But when this happen, there would be missing data and dashboards won't be able to show correct / complete data.
Users of the dashboards / application owners would notice this and contact us for remediation. Now, would you want your customers to escalate this to you so that you can troubleshoot the issue and rectify? That would be re-active. So, how can you monitor the data pipelines proactively?
In ELK, you can use watchers to monitor the indices on required frequency. The frequency of monitoring depends on how periodically you are ingesting the data / refresh frequency and can be different for each index. When you have dedicated monitoring cluster (based on Elastic) to monitor all ELK clusters in the environment, you can enable cross-cluster search (CCS) option. There is cross-cluster replication (CCR) option also, but that would require a larger monitoring cluster (more storage). The watcher can be configured to send alert to respective application stake holders, support team etc. who can pro-actively take corrective action without having to wait for the end-users to report issues.
The diagram shows integration with Slack for outbound notification alert, but using respective webhook, you can integrate with ServiceNow, MS Teams etc. according to your need.
Team: parthiban p ; Liju Thomas ; Rajesh Mehra ; Rajaraman Sathyamurthy