Best Practices for Monitoring using APM
Frequently I have thought of publishing an article that discussed some best practices with respect to possibly the most exciting aspect of DevOps - Monitoring.
OK so possibly it's not seen as being as exciting as solution design, coding or the post go-live party but it is an essential facet of any solution.
Historically monitoring was the responsibility of the Operations or Infrastructure teams but with the rise of DevOps this has certainly changed. The ability to design and build appropriate monitoring for systems are skills that the modern Software Engineer should possess in their bag of tricks.
Another significant change over recent years has been the increase of Application Performance Monitoring(APM) tools. Application Performance Monitoring allows us to dive much deeper into how our systems are performing. Personally I have been lucky enough to work in organisations that have invested heavily in AppDynamics which is a cutting edge APM tool. However some of the best practices I would like to discuss here are APM tool agnostic.
So what do I see as being some best practices?
1. Build a library of Health Checks - It's always a great idea to have a good base library of common health checks that can be reused. Every system should have health checks for CPU, Memory, I/O etc. so why re-invent the wheel? However the flip side here is that every system has a different run-time footprint and common health checks can be too generic. So it's best to use your library as a starting point for creating health checks and then tweak them to suit the target system.
2. Don't go too skinny on your health checks - It's quite common for monitoring solutions to be in place that function by simply polling a status page periodically. Whilst this is better than no monitoring it can be misleading and result in false positives. A well fleshed out monitoring solution will ensure that all core services of our solution are working as expected and notify us if that's not the case.
3. Monitor dependencies - A big benefit of using an APM tool is that we gain visibility into how our systems interact with their underlying resources (Databases, Web Services, Queues etc.). Using metrics relating to the performance of these resources we can create health checks and alerts. If your system interfaces with 3rd party systems, this can be a great way to ensure that service level agreements are being kept and to identify any performance issues.
4. Monitor your monitoring system - Nowadays the monitoring solution plays a pivotal role within an organisation. Some APM tools such as AppDynamics go well past basic monitoring / alerting and have the ability to perform an orchestration role. Using these tools we have the ability to perform actions such as running a remediation script on a server or auto-scaling your AWS environment when a health check changes status. With the reliance that we now have on our monitoring solutions the important question is - who monitors the monitoring solution? The answer generally here is that you have an ancillary monitoring solution in place that monitors the primary monitoring platform (and vice-versa). Both systems should follow a different maintenance cycle to ensure they are never both unavailable at the same time.
5. Learn from every outage - A well designed comprehensive monitoring solution should always alert us when an outage occurs in our systems (ideally prior to any customers letting us know). Unfortunately though this may not always be the case. It's important following an outage that was missed to review why it was not detected and to then add / amend any health checks to ensure that it would be picked up in the future.
6. Monitor load - Too often monitoring focus purely on response time and error rates of services. While these are critical metrics, it's important also to try and ensure we monitor load and to identify anomalies. If load falls significantly under expected levels this may possibly be an indication that customers are unable to reach our services. On the flip side, if load is significantly higher than a baseline this may indicate that possibly a malicious attack on a system is occurring.
7. Monitor the end user experience - For customer-facing websites it's important to have an end-to-end monitoring solution that has visibility of not only the performance of back-end services/hardware but also of the end user experience for our customers. A number of APM tools now allow for the collection of End User Monitoring (EUM) metrics from end users and / or synthetic traffic. Synthetics allows us to create artificial traffic from different locations globally and ensure that our services are working as expected.
8. Create baselines - The ability to create baselines for metrics such as load, response time and error rates is an important feature for effective monitoring. Using baseline metrics we can measure the performance of a system against historical data. Baselines can recognise recurring events such as nightly restarts or application release / patching windows. However following any significant changes to a system (e.g a change to the underlying hardware) it is important to determine if a baseline is now redundant and if a new one needs to be created.
9. Keep it simple .. But not too simple - Health checks should essentially be quite simple. For example a very simple health check may trigger if the error rate of a system is greater than the baseline. However if say a system has a baseline error rate of 0%, one failed transaction in a million could result in an alert being triggered for a possibly benign issue. In this scenario we generally want to have slightly more complex logic that will use a combination of load, error rate percentage and baseline deviation to determine if our system has entered into an unhealthy state.
10. Constantly refine - Systems evolve over time and the performance of a system when it initially went live can look dramatically different 6 months later. It's important to get into the practice of periodically reviewing health rules to ensure that they are still appropriate and adequate.
Thoughts?
Experienced DevOps technologist
7 å¹´AppDynamics is good but seem to be having trouble keeping up with new micro service and containerised envs. Any other tools you have looked at?
Account Executive @ Elastic | Strategic Account Management
7 å¹´Agree! Great article! Everyone's doing "DevOps" Are they doing it right if they don't have DevOps monitoring providing a continuous feedback loop?
Account Executive @ Elastic | Strategic Account Management
7 å¹´@Trevor Mallon
Managing Director Asia & Pacific at LexisNexis
7 å¹´Great all round article, well written. Glad we have Appdynamics