One of the main themes of 2017 was Web Operations.
I don't mean this was the year I started with Web Operations, it was the year it paid back on the investment.
The knowledge gained through a sustained investment in monitoring and alerting can be distilled into a few simple principles and practices.
- The Three Fs of Event Log Monitoring
- Incident Causation Principles
- Alerting Principles
- Monitor Selection Principles
I wrote about these in Web Operations Dashboards, Monitoring, and Alerting, and I have also wrote about these techniques in my blog posts on monitoring; but I wanted to highlight the benefits that we got by actually applying them. To give some context, these have all been used in anger, on a SaaS product used in 20 countries, and 15 languages; and hosted out of multiple data centres.
- Fixing event logs (making it so you can see real exceptions, by reducing noise) means you can solve real customer problems. Your live environment has great insights for you if you work at it. We used the Three Fs to reduce error-level logs until we were able to investigate each class of error.
- Understanding how to work back to a root cause means you can fix the problem. It is tempting to restart a machine to fix a problem; and then allow the investigation into the incident to fall aside because it isn't urgent any more (as we know, the only way to stop everything being urgent, it to understand what is important).
- When something is wrong, an alarm must sound, but ideally it shouldn't sound when there isn't a problem. This is a fundamental tension when it comes to Web Operations.
- You have to fine-tune your monitoring strategy, so you know you have to react to an alarm. The local shopping mall tests fire alarms every morning; and now nobody pays any attention to them. When the siren goes off, it should get your attention and that means making it go off at the right times (as defined by the alerting principles)
- You need to be selective with what you put on your dashboard, usually by picking leading indicators of problems. Although it is tempting to have lots of dashboards shown in rotation, a better idea is to have a fixed dashboard. Anomalies really stand out if you leave the same dashboard in the same place.
And finally, while Web Operations can be a technical exercise; an incredibly powerful technique is to add meters for business metrics into the same tool you use for the technical stuff. If you can see sales, leads, and other business goals alongside the technical information, you have a chance to correlate it.
Have a great 2018, and if you'd like to read more you can grab Web Operations Dashboards, Monitoring, and Alerting on Amazon.
Senior Project Manager at NATS
7 年Robert Borland - something we were discussing shortly before Christmas was how to do something useful with event logs. Steve's article is worth a read.