登录查看更多内容

Use your Scheduled Downtime (pt 1)

Leonid Yashchuk

Senior IT Project Manager – EPAM Systems

发布日期: 2018年8月12日

+ 关注

Part 1 (part 2 is here)

Service Level Agreement

There are many things that you can never avoid in the real world on your production environment. Among other things, this is downtime (or outage).

Even if you running high availability architectural solution, then this is just a high, but not a continuously available solution. And outage is what you should expect and be ready for.

When you provide services you have a set of Service Level Agreements (SLA). In general, these are contracts between you and the users of the system or service. In fact - the obligations of the parties, describing what the user is paying money for in terms of the quality of the service and that what the service provider is obliged to do to ensure the required level of service.

Besides SLA is not only between you and users, they are (I hope) and between your services. Few words about this later.

In terms of business, one of the most critical is the SLA for uptime (or availability). This is the percentage of time in a year (quarter, month, etc.) during which the service will be guaranteed to be available to the user.

In some cases uptime and availability may not be synonymous. For example, a service can be up, but for a number of reasons it is not available to the user or to user target audience. It is most reasonable to determine the availability or uptime of a service (or a system as a whole) in terms of performing functional tasks. Therefore, the user and operator must unambiguously understand what is meant by these terms. These terms are then used as synonyms for some simplification.

Downtime is the percentage of time that a service may be unavailable. In the vast majority of cases, this is not an obligatory value. Usually this is seen, as: "The service will be inaccessible for as maximum as this, but we will do everything possible to make downtime less."

An important question what will be considered as downtime and what will not. As well as how downtime will be measured. For example, you have a solution which contains of some core, around which 5 services are “spinning”. Will downtime be considered if one of the services falls? And what if 2 will fail (or 3…)?

The key to the answer is the dependencies between services, and how the downtime of one of them will affect the solution as a whole. In particular, it is necessary to monitor the availability of each of the services. And to determine what will be considered as downtime for each of them separately.

As example, one of the options for determining a downtime of solution is to determine the set of basic business flow. Which flows will be considered as a major depends on the context. This can be the most commonly used flows; critical from the functional point of view; critical in terms of "white gloves customers", etc... And when major flows are failed it's downtime. Situation when one or more of the services, that does not affect the solution as a whole, is unavailable, can be considered as a Severity 1 incident. Which, however, requires an immediate actions. Once again it's all depends on context.

Accordingly, Service Level Agreements must be determined for each of the services (their values may vary depending on the service). Also there should be an SLAs between services. Thinks about this as about team work. If your team members doesn't fulfill the agreements among themselves, how can you match the contract to your customer. And SLAs between services must necessarily correlate with the "customer faced" SLAs and with the SLA on the entire solution as a whole.

要查看或添加评论，请登录

Leonid Yashchuk的更多文章

The right solution depends on the task at hand

2025年1月16日

The right solution depends on the task at hand

Back in the day, as a Delivery Manager, I had the opportunity to lead a project focused on migrating and upgrading an…

1 条评论
Use your Scheduled Downtime (pt 2)

2018年8月12日

Use your Scheduled Downtime (pt 2)

Part 2 (part 1 is here) How customers think about availability: Availability (%) = 100% (Whole time per particular…
When Release is released?

2018年7月25日

When Release is released?

How often do you hear the question: When will this release will released? You will not believe how many people think…

Use your Scheduled Downtime (pt 1)

Leonid Yashchuk

Senior IT Project Manager – EPAM Systems

Leonid Yashchuk的更多文章

社区洞察

其他会员也浏览了

Why is capacity planning important?

Reliability Vs Availability: Tutorial & Examples

Preparing For The Worst

Post CrowdStrike Outage, What Lessons We Should Learn

SLAs & NOCs

Downtime Disaster? How AiNET Ensures Business Continuity with Failover Solutions

Finding the unknown unknowns

BEHIND THE SCENES OF LIVE OUTAGE

Five Timestamps; Four Metrics

How to calculate availability of the system?

Leonid Yashchuk的更多文章

The right solution depends on the task at hand

Use your Scheduled Downtime (pt 2)

When Release is released?

社区洞察

其他会员也浏览了

Why is capacity planning important?

Reliability Vs Availability: Tutorial & Examples

Preparing For The Worst

Post CrowdStrike Outage, What Lessons We Should Learn

SLAs & NOCs

Downtime Disaster? How AiNET Ensures Business Continuity with Failover Solutions

Finding the unknown unknowns

BEHIND THE SCENES OF LIVE OUTAGE

Five Timestamps; Four Metrics

How to calculate availability of the system?