Use your Scheduled Downtime (pt 1)
Part 1 (part 2 is here)
Service Level Agreement
There are many things that you can never avoid in the real world on your production environment. Among other things, this is downtime (or outage).
Even if you running high availability architectural solution, then this is just a high, but not a continuously available solution. And outage is what you should expect and be ready for.
When you provide services you have a set of Service Level Agreements (SLA). In general, these are contracts between you and the users of the system or service. In fact - the obligations of the parties, describing what the user is paying money for in terms of the quality of the service and that what the service provider is obliged to do to ensure the required level of service.
Besides SLA is not only between you and users, they are (I hope) and between your services. Few words about this later.
In terms of business, one of the most critical is the SLA for uptime (or availability). This is the percentage of time in a year (quarter, month, etc.) during which the service will be guaranteed to be available to the user.
In some cases uptime and availability may not be synonymous. For example, a service can be up, but for a number of reasons it is not available to the user or to user target audience. It is most reasonable to determine the availability or uptime of a service (or a system as a whole) in terms of performing functional tasks. Therefore, the user and operator must unambiguously understand what is meant by these terms. These terms are then used as synonyms for some simplification.
Downtime is the percentage of time that a service may be unavailable. In the vast majority of cases, this is not an obligatory value. Usually this is seen, as: "The service will be inaccessible for as maximum as this, but we will do everything possible to make downtime less."
An important question what will be considered as downtime and what will not. As well as how downtime will be measured. For example, you have a solution which contains of some core, around which 5 services are “spinning”. Will downtime be considered if one of the services falls? And what if 2 will fail (or 3…)?
The key to the answer is the dependencies between services, and how the downtime of one of them will affect the solution as a whole. In particular, it is necessary to monitor the availability of each of the services. And to determine what will be considered as downtime for each of them separately.
As example, one of the options for determining a downtime of solution is to determine the set of basic business flow. Which flows will be considered as a major depends on the context. This can be the most commonly used flows; critical from the functional point of view; critical in terms of "white gloves customers", etc... And when major flows are failed it's downtime. Situation when one or more of the services, that does not affect the solution as a whole, is unavailable, can be considered as a Severity 1 incident. Which, however, requires an immediate actions. Once again it's all depends on context.
Accordingly, Service Level Agreements must be determined for each of the services (their values may vary depending on the service). Also there should be an SLAs between services. Thinks about this as about team work. If your team members doesn't fulfill the agreements among themselves, how can you match the contract to your customer. And SLAs between services must necessarily correlate with the "customer faced" SLAs and with the SLA on the entire solution as a whole.