Monitoring and Alerting
In previous sections, I used analogy to the human body and I compared Kafka to the heart and the Portal to the brain. We know these usually function well, but not always. Therefore, we need to constantly monitor them, check their performance and health, detect any changes, and respond to them on time. I would compare this functionality to the nervous system, which monitors the entire body and responds to changes to maintain balance and protect against harm.??
??
??
?
??
Prometheus & Grafana : Metrics collection, Monitoring and Visualisation??
Prometheus is the cornerstone of our Monitoring system. This open-source application is commonly used by the majority of the organisations for monitoring and alerting purposes. Because of its popularity, Prometheus is stable and well integrated with various platforms, also there is huge support from the community for any issues that may arise.???
We’re currently scraping more than a hundred metrics from most of the components of our platform which are located either on virtual machines or in Kubernetes. We collect different types of metrics, such us:??
??
In Prometheus rules define the conditions under which the alerts will be triggered and subsequently sent to Microsoft Teams Groups via Alert Manager.???
It’s difficult to analyse metrics directly in Prometheus as they are in the numerical form, therefore we visualise that data in Grafana, where users have access to respective dashboards and charts. Besides its dashboarding capabilities, Grafana can also send alerts, and in our case we use it for that purpose in Monitor Group.??
??
领英推荐
Kibana & ElasticSearch??
While Grafana and Prometheus are excellent for detecting any anomalies in the platform, they are unable to tell us what exactly is happening with a particular application. For this purpose Kibana and ElasticSearch come with help, which give us centralised access to logs, instead of going to each individual server and analysing files one by one.??
Everything starts with Filebeat, a lightweight application that is basically a simple log collector which sends logs to ElasticSearch. This search and analytics engine stores the logs data, optimised for full-text search, allowing for quick retrieval of information and real-time analytics. Eventually, users can analyse this data via Kibana dashboards, charts and graphs.??
??
Audit Functionality??
As mentioned earlier, the platform hosts thousands of Kafka Topics and associated objects (e.g. certificates or ACLs). We need to make sure that topics are actively used so that resources are optimally utilised. Therefore, we audit following conditions:???
A daily process checks these conditions and sends emails to the respective topic owners with a request to take a specific action.??
??
Summary??
In this article, I aimed to provide you at least a fraction of information about the EDH platform and its components. Detailing each aspect thoroughly would probably require an entire book chapter!??
However, I hope that after reading this, you have an idea about the purpose of this platform, a high-level overview of its main components, and technology behind it.???
Obviously, we haven’t stopped and we continue to introduce new enhancements to deliver the best possible product for Bayer engineers. Apart from regular OS, software upgrades, and KRaft migration mentioned earlier, we are adding more functionalities to the EDH.???
One of the worth mentioning is EDH Wire, which integrates the platform with various data sources and sinks, so users don’t have to develop their own producer and consumers.???
Another tool, currently in development,? is the Stream Transformer. This tool offers advanced transformation and integration capabilities (i.e. merging, joining, routing, etc…). Built on Apache Flink and it inherits all of its features. The jobs are deployed on Kubernetes cluster, ensuring scalability, fault tolerance and easy maintainability.???
But, that’s not all and we have another ideas in our roadmap which will be described in the next article!??