Monitoring Tools Tailored to the Human

Monitoring Tools Tailored to the Human

Once upon a time at a client, an example arose of a well-designed technical solution falling partially short of achieving its desired outcome due to an unanticipated people factor. Adjusting the technical approach to account for this factor helped the situation. What follows is inspired by this customer experience.

In order to move to a more pro-active approach to incident handling (for servers), a monitoring tool was designed and implemented to alert technicians not only when things go wrong, but also when servers become degraded, allowing admins to get out in front of, and (ideally) resolve, issues before they become, well…issues.

With this sort of tool, however, spurious alerts must be managed, else technicians view the alerts as “noise” and ignore them altogether. For this reason, tuning the tool is an important phase of the implementation; but there must also be a process change for performing routine work on the servers – placing the devices in maintenance mode. If a device is rebooted without first disabling the monitoring, spurious alerts may result - placing them into maintenance mode is the method to prevent such things.

So…the tool was implemented, tuned, and procedures modified to require technicians to access the monitoring console and place devices in maintenance mode prior to rebooting them. And yet, they did not do so – for some reason, and I haven’t yet divined why, some folks just feel it’s easier to close out alerts the next day than it is place the devices in maintenance mode (I think it may be a path of least resistance thing – logging onto the console is too much to ask). Technicians were patching and rebooting servers without maintenance mode, generating a host of spurious alerts. Reports contained inaccurate data, unnecessary tickets were created, time was wasted researching meaningless alerts.

A few of my teammates and I were chatting about this over lunch one day, and I put the idea out there about eliminating the “extra” step of accessing the monitoring console by providing a means for the server technicians to place the devices in maintenance mode right from the device itself – a script they could run right from the desktop. The result of this collaborative effort may be found here – a method for placing devices into maintenance mode remotely.

For Operations Managers, and for Event Management process owners, the reduction of those (say it with me) spurious alerts is a critical success factor. As engineers, we may design our tools and processes in such a manner that we believe we have provided everything needed to realize our goals, but do not forget the human factor!

要查看或添加评论,请登录

Mark T.的更多文章

社区洞察

其他会员也浏览了