We Need to Talk about On Call and Human Factors
Photo by cottonbro studio: https://www.pexels.com/photo/close-up-shot-of-a-person-calling-on-a-phone-6862844/

We Need to Talk about On Call and Human Factors

One of the great examples of where the IT industry often fails when it comes to understanding human factors is in the way we usually handle on call. I have worked for companies that tried to do things right and also for companies that ended up with totally unworkable on call systems, and yet this is something everyone struggles with.

As a note, I do discuss the fact that courts have been active on this topic as well, but my point here is mostly to note that courts sometimes do recognize these problems. I am not a lawyer and nothing here is legal advice.

Why We Need On Call

Businesses that continuously serve customers always have some systems, whether email and web servers, database systems, etc. that must always be on line in order to ensure that business continues to operate. Having such a system down for the night, or worse still for a weekend, is usually totally unacceptable both to such businesses and to their customers. An example most people can relate to is that if your internet connection went down on Friday night, you wouldn't be happy waiting until business hours on Monday for someone to come in and fix it.

The usual solution here is to have engineers who can be called in to work on problems when they arise after hours. The engineers are not expected to be working most of the time, but are expected to be available for work when needed within a reasonable response time.

The problem though is that sometimes there is a perception that on-call engineers are instantly available. This is not merely unfair but it also undermines the reliability of the systems on-call is supposed to protect.

On Call vs Waiting At Work

On Call arrangements, in their ideal form allow on-call personnel to go about their lives and then be effectively called in to work when something goes wrong. This is different from waiting at work and being immediately available to address an issue.

To be sure, there are cases where response time can be generally assumed to be rapid. For example, if an engineer is asleep when the call comes in, we can generally assume that the engineer will immediately log in and begin work.

However, there are many other things we often do in our personal lives that can take time to complete and cannot be quickly abandoned. These can range from family obligations such as walking small children to school to things all of us need to do frequently such as grocery shopping, etc. On call regimens should not be disruptive to any of these things. You cannot expect an engineer who is a parent to stop walking kids to school when on call, or prepare to do grocery shopping first.

This distinction is critically important in places here labor laws limit working hours. In 2003, the European Court of Justice found that a doctor in Germany who was on standby at a hospital was effectively working because he was required to be on the hospital premises and available for work during the entirety of his on call shift. In 2018, the same court again found that volunteer firefighters who were required to live within 8 minutes of the fire station and leave immediately when called were working for the entire period of their standby shifts. In 2021, the court again addressed the issue more generally and found that workers which were frequently called for larger services during on call periods should be considered to be working during the entire shift, while those who work only on occasional small tasks are not.

The EU legal actions here come in the context of EU occupational health legislation, and the health and safety impacts are also reasons why we all should care. Health and safety is all of our jobs and we shouldn't need courts to get involved.

The Cost of Getting On Call Wrong

When I look back at my career, the large operational mistakes I have made have all been the result of on-call related fatigue. Circadian disruption, sleep disruption, and lack of real rest periods were usually the primary contributors. In one case I declined to follow up with an escalation call when I was told that the person I thought needed to call was unreachable. The result was that networking issues lead to significant financial losses.

Stressful, noisy on call shifts pose significant personal costs both in terms of physical and mental health, can pose stress for families of on call engineers, and impose other personal costs. These I probably cannot fully describe here, but the reliability issues are clear. We have a problem as an industry and we need to do better.

We are only now starting to understand the physiology of mental fatigue with breakthroughs in understanding what is now called the glymphatic system, and the metabolic hallmarks of mental fatigue. However what is clearly known is that good quality sleep, time to recover away from strenuous mental tasks, and stable sleep patterns are necessary to avoid such fatigue.

And Fatigue is the number one enemy to reliability. In my talks I often ask how many of us in the industry have had to work on critical systems while drunk or at least been on teams where someone has. Yet very few have seen this cause significant measurable problems. Yet, when it comes to outages, nearly everyone has seen outages caused by people working on systems while fatigued or tired. And yet as an industry, we valorize working while fatigued (and this is a big part of our on-call problem).

Do's and Don'ts of On Call Policies

A healthy on call environment benefits us all. The engineers who work on call benefit by lower stress, better health, and better productivity. Businesses benefit in terms of reliability. To this end, I have several key recommendations to improve on call for everyone:

Don't expect or demand immediate or rapid response all the time.

Do carefully think through response times in terms of ordinary human activities. If someone is expected to go home leaving grocery shopping unfinished, that is too short. Usually I have pushed for an agreement to be reachable by telephone within 15 minutes, and then be able to start work within another hour.

Don't have an absolute ban on consumption of alcohol. At least in IT, there is no reason an on-call engineer cannot have a glass of wine with dinner at a nice restaurant while on call.

Do treat alcohol like you would during off hours but with the idea that the engineer may need to appear for work unexpectedly. It is ok to say not to consume alcohol in a manner that impairs one's duties. Just don't have a blanket ban.

Don't expect people to show up to work immediately after working at night on incidents.

Do formulate and enforce a policy of rest times that give people a proper chance to recover. I have usually pushed for a day off for every four hours worked on call as there is always time required to fall back asleep. However at a minimum engineers should be empowered and encouraged to take rest following any sort of alert during the on-call shift.

It is possible to have healthy on call cultures and expectations. And we must all do our part to make that happen.


??Valeriy Meleshkin

Living in a database

1 年

?? another thing from personal experience: giving a person on-call the sense of their back being covered is essential. - People with deeper expertise in narrow domains should not be very hard to reach - There should be secondary and tertiary fallbacks.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了