The Flaws of the average when collecting System Metrics
The above picture tells the whole story, this is what happened to us when we were collecting the average CPU usage. We are using https://github.com/sensu-plugins/sensu-plugins-cpu-checks this plugin for collecting system metrics, but it did not help us to figure the out issue, and our application crash sometime and the average CPU was seems fine.
So, Later we realise the our node application consuming and contributing to high CPU usage when I start to write custom script or plugin to monitor each and every process that seems which can contribute to high CPU usage.
Here is the Server stats which seems fine but keep in mind you will be wrong if you rely on the overall CPU average.
So our plugin was done the job to identify the classic case of the flaw of the average.You can find it on GitHub. Here is the stats from our plugin during Application crash.
This is very simple to integrate all you need to create simple check configuration with process name to monitor.
{
"checks": {
"myapp-stats": {
"type": "metric",
"command":"/opt/sensu/embedded/bin/metrics_per_process.py -p /home/user/my-app.js",
"interval": 60,
"subscribers": [
"my-app"
],
"handler": "librato"
}
}
In Next article I will share how to deal with such high CPU usage and how to terminate automatically a process that is not important to run for example filebeat is not more important than the server itself, but what if filebeat consuming all the resources?
Backend Developer
4 年I am suffering from same problem more or less, hope it will be fixed soon.