The acoustic roots of “Zoom fatigue”
The Covid19 pandemic has permanently reshaped office life. Introduced as a sort of emergency patch in order to keep companies and institutions working during lockdown, frequent use of videoconferencing tools has become part of the “new normal”. An ability to participate in remote or hybrid meetings via Zoom, Teams, Meet etc. reduces the need to travel, saves time and affords unprecedented access to business partners, colleagues and audiences far and wide. On paper, the potential is huge: videoconferencing allows people to fit many more meetings into their schedule, without actually leaving their office. In quantitative terms, you can get “more work done in less time”. And if you are teleworking, you can do so even from the comfort of your own home. Less traveling, less commuting, possibly more time for yourself and your family: videoconferencing is, ostensibly, a sure-fire way of improving your well-being and your work-life balance.
Yet, many videoconferencing participants’ experience of online meetings is unpleasant and tiring, and scientists are warning that “Zoom fatigue” may have serious consequences for human health, including burnout (1). Numerous attempts have been made to explain the causes of videoconferencing fatigue, mainly based on cognitive / emotional distress and frustration due to the lack of eye contact, diminished or non-existent access to visual cues and body language, multitasking, lack of (or unnatural exposure to) other types of visual information being just some examples (2). Under this visually oriented, cognitive perspective, the videoconferencing environment does not properly satisfy the basic, innate requirements for effective communication between human beings, which generates subconscious frustration. Frustration leads to an adverse emotional response and negative emotions, which, in turn, generate stress.
This way of explaining the problem has three major shortcomings:
a) As it is based on the media naturalness theory (3), it tends to reduce the non-verbal aspects of communication to merely visual elements for conveying emotions and creating a rapport between people. It therefore focuses on visual aspects such as eye contact, facial expressions and body language (and the absence thereof during videoconferences) and visual feedback (participants apparently feeling alienated by the sight of their own face on the screen), compounded by additional stress factors such as multitasking or asynchronicity due to system latency.
b) It considers chronic fatigue, stress and burnout as the almost exclusive potential outcomes of frequent exposure to videoconferencing.
c) It fails to explain why the same factors, or similar combinations thereof, do not lead to similar outcomes in other settings: For instance, listening to a radio programme also involves the total absence of body language, visual cues, facial expressions. Listeners typically engage in other tasks while listening to it (ironing, dishwashing, driving, cleaning, working out…), but exposure to a radio programme while performing other tasks does not appear to generate stress. Likewise, exchanging emails or writing letters involves huge, noticeable latencies (hours, days, weeks), no body language and possibly a great deal of frustration if no answer is received, but the concept of “correspondence fatigue” has never become mainstream and is not known in the scientific literature. Watching YouTube tutorials or livecasts also involves micro-latency issues (most YouTubers are very slightly out of sync), an inability to interact (a crowded chat is all you have available during a YouTube or Facebook livestream) and the absence of eye contact when the speaker is not looking directly into the camera, but “YouTube fatigue” has not become an issue either, notwithstanding its popularity and widespread use.
Inaccurate as this attempt to explain fatigue might be, videoconferencing stress is real and its impacts on the human nervous system have been finally proven and described by a study recently published on Nature. A team of Austrian researchers has convincingly shown that mere passive exposure (no bi-directional communication, no interaction) to less than 1 hour of videoconferencing (e.g. a university lecture) causes measurable stress and fight-or-flight reactions not seen in subjects exposed to the same lesson in face-to-face mode (4). Significantly, these results were produced in a quiet, stress-free and controlled laboratory environment, where exposure to a videoconference call was the only potential stressor.
This means that there is something about videoconferencing that inherently causes brain fatigue and measurable autonomic-system activity, typically seen when human beings perceive the presence of a threat. This happens even with passive exposure to situations where interaction and multitasking are not required - which would rule out frustrated expectations of interaction, multitasking, eye-contact and the like as a potential source of cognitive and emotional stress.
But what threat is the autonomic nervous system reacting to? And is this threat real or imaginary?
Some light on this question can be shed by focusing on a usually overlooked variable. It is typical of videoconferencing environments and virtually absent from other, more professional forms of broadcasting and from face-to-face interaction that both do not seem to cause any stress to recipients. This factor is the presence of degraded and heavily manipulated audio signals.
Though most people (and research teams) have so far failed to notice this, videoconference audio is highly unnatural and often heavily processed. Some of the key factors behind degraded videoconferencing sound have already been proven to:
a) cause measurable stress reactions in both human beings and test animals, even when these factors are taken individually; b) not to be of a visual, cognitive or relational nature.
Let’s take a look at these stressors:
“Poor sound”
In a study published during the early 2000s, Wilson, Gillian & Sasse (5) showed how different types of audio degradation elicited different reactions in test subjects. Some were psychological only, others were neurophysiological. On the one hand, the sound defects typically caused by poor connectivity or network issues (packet loss) and resulting in loss of speech intelligibility of content (missing syllables, words, phrases or sentences) were reported as annoying and unpleasant by test subjects, but they failed to unleash any significant physiological stress reaction. On the other hand, audio degradation caused by low-quality microphones and “loud sound” was hardly noticed by test subjects and did not therefore result in any “cognitive” complaints, but it did cause measurable fight-or-flight reactions in participants. Their nervous systems were clearly perceiving a threat that their conscious mind failed to notice and identify, even though the laboratory task was minimally engaging and did not involve any real cognitive or relational effort.
A significant impact on the human nervous system could therefore be measured as a consequence of stimuli that did not reach the consciousness of test subjects and did not cause any cognitive or psychological reaction.
Fried guinea pig brains
A recent experiment conducted by Professor Paul Avan’s team at the Institut Pasteur (6) has shown how heavily processed digital sound (in this case, heavy dynamic range compression) can literally destroy the hearing system circuitry in the brain stem of test animals at listening levels that did not cause any irreversible brain stem damage in animals exposed to far less processed signals. The middle-ear reflex (the mechanism that protects the ear from loud sound) of animals exposed to heavy dynamic range compression could no longer function properly after the experiment had been completed. Dynamic range compression is a form of digital audio processing that can be used to “inflate” sound and to engineer a sense of perceptual loudness, while not raising peak levels. Tinnitus at the dentist’s Dentists are particularly exposed to tinnitus and hearing loss, owing to the noise made by dental drills and the like. Studies on these drills have mostly failed to detect the presence of emissions exceeding safe limits, and dentists’ practices are usually not experienced as loud places by patients. So how are hand-held devices damaging the ears of dentists? By looking into the qualitative aspects of dental drill noise, one research team (7) has found that the noise made by these devices is concentrated in very specific ranges of the audible spectrum (2k-6kHz). Incidentally, the human auditory system is particularly sensitive to high-frequency vibration (especially in the 3k-5kHz). Opera singers leverage resonance in those areas (especially 3kHz) to override the background sound of orchestras, and infant voices deliver more energy than adult voices in the same area (especially around 4kHz) in order to attract attention. Shouting also increases energy in the same area of the audible spectrum. Delivering higher concentrations of acoustic energy around 3k-4kHz is a sure-fire way of attracting attention by causing alarm - and, to the nervous system of human beings, alarm means stress. The human inner ear has evolved to be particularly sensitive to those frequency bands. Specialized sensors are both very abundant. They are also easily damaged by acoustic trauma, which is usually caused by occasional exposure to extremely loud spikes (jet engines, gunshot) or frequent exposure over time to large average doses of noise in loud environments (e.g. factories, construction sites).
It would appear that concentrating acoustic energy in sensitive frequency bands may harm the human ear even at much lower levels than those that, to date, have been considered unsafe. Interestingly, a study conducted in Sweden has recently found that pre-school teachers exposed to theoretically safe daily doses (75-85dB) of infant voices have a higher incidence of hyperacusis (8).
But what does videoconferencing have in common with crying children, opera singers, dental drills, overcompressed music or high average dB exposure? Is videoconferencing sound “loud” or “operatic”? Do videoconference participants shout or use dental drills during calls? And, if a psycho-cognitive model of the problem is applied, does videoconference sound lack any of the crucial elements of human communication?
Videoconferencing sound is aggressive
Not many people notice this, but the voices of videoconferencing participants and the noise they can occasionally make near their devices are extremely artificial. Videoconferencing audio is anything but natural-sounding. It’s usually piercing, metallic, slightly (or very) robotic, artificial and often muffled, even when connection speeds are very high and video is HD. In other words, videoconference sound is intrinsically “noisy”. Multiple tests of videoconferencing platforms have shown how the widespread use of AI algorithms that “optimize” voice, remove background noise, keep your voice at a constant level, regardless of how far you are swaying away from the microphone, and prevent audio feedback introduce sizeable amounts of harmonic distortion, muffle sound and reduce intelligibility. Noise-filtering algorithms have the biggest detrimental impact, since getting rid of background noise live is an AI-driven guessing game and doing it without compromising the quality of the signal (i.e. by degrading timbre) is virtually impossible. The more aggressive the filtering, the heavier the muffling, distorting and robotizing impact on the speaker’s voice. Noise filters also have a widely recognized impact on speech intelligibility, which is why additional AI algorithms are used to compensate for loss of intelligibility by engineering a sense of perceptual loudness in the signal in order for it to stand out against the background noise and make the most of the receiver’s usually undersized and underperforming computer speakers or headphones.
Put in oversimplified, layman’s terms, this is typically done in 3 ways:
a) by equalizing the signal aggressively and raising levels in a targeted manner: the signal is manipulated so that proportionally more dB (often even + 20, 25dB) are assigned to ranges of the audible spectrum where the human ear is particularly sensitive: the notorious 3k-5kHz area. This makes the sound piercing and metallic, without making it exceed theoretically safe dB levels, and clearly punches the auditory system where it is less defended. This is the AI-generated equivalent of dental-drill noise.
b) by raising softer components at a microscopic level so that they become louder than they would normally be. While the overall permissible peak dB levels are not exceeded, the average dB content of the signal is increased, while energy is concentrated in very sensitive ranges of the inner ear, as shown above. The “dental drill” is never softer than a certain level, determined by an AI algorithm programmed (or operating autonomously) without much understanding of how the human hearing system is supposed to function. This type of processing is a form of aggressive dynamic range compression (not to be confused with data compression, as in mp3) also known as “upward compression”. Measurements have shown that exposure to dynamic ranges not exceeding 10 dB is frequent in the videoconferencing and hybrid setting, which means that the sound is flat and the auditory system (especially some of its softer spots) is kept under constant, unrelenting pressure. The laws of physics are very clear regarding the effects of constant pressure concentrated in a small surface area: pressure does not need to be heavy to dig a hole if it is applied constantly and for long enough. Bed ulcers in bed-ridden hospital patients are an excellent example for visualizing how this type of signal can damage the auditory system. Published research, moreover, shows that the human auditory system activates protective reflexes (the stapedial reflex, also known as the middle-ear-reflex) even at levels considered quiet when 3k-5k bandwidths are louder than they usually would be in “natural” audio signals. The system is spotting a threat, and it tries to fend it off (9).
Experiments referenced above (guinea pigs) have also shown that overuse of dynamic range compression can have catastrophic impacts on the auditory components of the mammalian brain stem.
c) Whenever faced with a sound perceived as stressful or threatening (including, but not limited to, “too loud”), the human auditory system activates the so-called stapedial reflex: the tiny muscles in the middle ear contract to block that sound, thus limiting inner-ear exposure to a dangerous stimulus. However, this reflex needs a few milliseconds to kick in (latency).
AI algorithms (and careless sound engineers) that want to increase the “crispness” of consonants in order to “boost speech intelligibility” have ways of making dynamic range compression so aggressive that every consonant becomes a micro-threat to the auditory system. Consonants are rich in high-frequency content and “s” sounds can even reach up to 5kHz. Consonants usually begin with a very short spike in dB level, followed by a sudden drop (a phenomenon known as an “attack transient”). Of course, no natural consonant can rise to its maximum intensity so fast that the stapedial reflex cannot kick in if necessary. Yet, when a digital compressor is programmed in such a way as to cause these spikes to become faster than the stapedial reflex - which is perfectly possible with modern technology - the sudden spikes generated by hundreds of overcompressed consonants per minute, even at supposedly “safe” listening levels, can turn into a flurry of micro-shocks delivered to the softest spots of the inner ear.
Huge levels of compression and distortion (clipping) while a speaker is taking the floor from a remote site
A combination of two or more of the above factors is, unfortunately, a frequent occurrence in the videoconferencing and hybrid environment, and constitutes a clear and direct threat to the auditory system and the central nervous system of participants. It is therefore no wonder that autonomic system reactions are detected that would normally correlate with the presence of a threat or of nociceptive stimuli.
The fear subliminally perceived by test subjects in the studies by Riedl, R., Kostoglou, K., Wriessnegger and Wilson and Sasse (see above) is, therefore, real and perfectly justified!
Why overprocess sound?
The point of overprocessing through AI algorithms is to allow people to join a meeting from the street, a noisy train, a windy beach or their car and still be able to make themselves heard. It is also used to purportedly “improve” the low-quality sound made by microphones built into laptop computers. Unfortunately, measurements and experience demonstrate that these “enhancements” come at a very high, but often unnoticed price.
Videoconferencing sound is intrinsically loud, through a particular type of “loudness” that is obtained without exceeding supposedly “safe” levels - a trick already used to some degree in TV advertising. The average dB content of videoconferencing signals is measurably higher than that of “natural” sound, and these signals are inherently noisy and distorted.
To the human nervous system, the voices of videoconferencing participants are virtually screaming with operatic twang, as though they were on digital steroids.
The sources of overprocessing
Overprocessing in the videoconferencing environment is applied by multiple layers of live Digital Signal Processing (DSP) tools. These DSP tools are typically found in:
Given that compressors work logarithmically, each layer of processing added on top of the previous ones boosts them exponentially (compressors work logarithmically), which massively escalates their negative impacts and generates overcompression.
What are the repercussions for human health?
Research has so far mainly concentrated on the impacts of videoconferencing in terms of fatigue, stress and burnout syndromes, but once overprocessed sound is understood as being the chief factor behind this type of fatigue, the link between videoconferencing stress, the auditory “threat” and auditory health problems can no longer be ignored. The following are two examples that clarify this link.
领英推荐
Anecdotal evidence of other categories of workers frequently exposed to videoconferences suggests that the problem might be more widespread. ENT doctors do not typically ascribe these symptoms to poor-quality sound because most of them have no specific training in, or understanding of, sound quality or sound engineering and they have never set foot in a call-centre or an interpreting booth. Moreover, such concepts as “toxic sound” and “auditory burnout” do not yet appear in the medical literature, so doctors are typically unaware that poor sound quality is a potential risk factor. A number of teams of researchers are now investigating this, both in Europe and in North America, but evidence of videoconference signals being aggressive and overprocessed is already abundant since it is extremely easy to obtain.
Is every participant in a videoconference call experiencing burnout or auditory health problems?
The impact of overprocessed videoconference audio is obviously not being felt to the same extent by all participants. On the one hand, a certain proportion of the general population is known to be genetically more sensitive to noise and more prone to developing auditory conditions: videoconference signals might simply be driving people whose auditory systems are more fragile “by design” towards collapse. Yet, how many of those auditory systems would develop otherwise rare and disabling conditions like hyperacusis (up to 30% in highly exposed populations) if they were not being aggressively overstimulated by AI algorithms? On the other hand, noise is known to cause harm through prolonged exposure to levels that are not harmful provided that they are limited to 1-2 hours a day. For this reason, regulatory noise limits are always based on daily/weekly doses. Populations with high incidence of disabling auditory symptoms (call-centre operators and, since Covid, interpreters and conference clerks) seem to be concentrated in workplaces where multiple layers of overprocessing can easily be identified. This increases the dose (and makes its content more “toxic”) and it can therefore accelerate the damage-causing process. Workers exposed to videoconferencing in environments where fewer layers of overprocessing are applied might only require more exposure for them to end up having to report to an audiology ward. They might develop fatigue and burnout syndromes before they develop tinnitus. Tinnitus and hyperacusis may also lead to depression and fatigue. Tinnitus can keep sufferers awake at night, while the constant stress generated by the fact that everyday sounds have become harmful and even fear-inducing can take a heavy toll on one’s mental well-being. These symptoms could therefore be viewed as precursors to, or aggravating factors, of depression, fatigue and burnout syndromes.
Finally, it should be borne in mind that not all smokers end up developing lung cancer and not all factory workers exposed to asbestos have developed asbestosis. Risk factors do not necessarily produce negative impacts in each and every subject exposed to them.
Does videoconferencing really increase productivity?
Productivity can be measured in different ways. On a merely quantitative level, replacing face-to-face meetings with videoconferences allows employees and managers to take part in many more meetings than they could attend when Zoom calls were not the norm. However, on a qualitative level, the sorts of results that can be obtained from online meetings hardly compare with in-person interaction: when they are unconsciously feeling threatened and their fight-or-flight mode is activated, participants can hardly be expected to be creative, constructive, empathetic and collaborative. Optimum performance is never achieved (let alone maintained) when people are feeling threatened. Moreover, trust and credibility are key ingredients in any successful team effort or negotiation. An interesting piece of research (13) has recently demonstrated how the degrading of the audio quality of presentations has a major impact on the perceived credibility of the speaker and of their presentations' contents. In that study, a group of scientists was asked to grade both the personal credibility of fellow scientists presenting their research and the quality of the research being presented. Presentations were recorded, then shown to two different groups of evaluators. Both groups were treated to exactly the same contents, accent, speed of delivery, presenting skills, body language etc.; the only difference was the quality of the audio signal (i.e. the timbre of the presenter), which had been deliberately degraded. It turned out that the group treated to presentations with degraded timbre found the presenters to be less credible and trustworthy and their contents to be of lesser quality. Timbre is the signature of a person’s voice. It reveals the speaker’s inner posture, the amount of tension in the muscles, ligaments and mucous membranes, and their relative configuration. It is the audible body language of the speaker’s nervous system and inner organs. Timbre indicates to listeners what type of “object” is making a sound, what its shape is and what it consists of. Timbre makes it possible for listeners to distinguish a violin from a trumpet when they are playing exactly the same music. Videoconferencing tools do not distort the speaker’s intonation, rhythm or accent; distortion and spectral manipulation alter the timbre of voices (14).
The reason why degraded timbre has such a huge impact on the credibility of speakers is that human beings tend to be more trusting in people they feel are near to them and similar to them. Timbre contains the spectral cues that allow human beings to establish the distance between themselves and a source of sound. The voices of videoconference participants are typically deprived of part of the frequencies that signal proximity, so they may sound distant even to listeners who are wearing headphones. Alternatively, they may sound so near that they become intrusive, or come across as difficult to place in space (and therefore unreal) because some spectral components are too loud and others too soft. At the same time, most videoconference signals reproduce voices as though speakers had no forehead and no nose and merely consisted, in auditory and perceptual terms, of a neck and an oversized mouth. In real life, no human being sounds like a videoconference participant. Based on what meeting participants sound like online, the deep layers of our nervous system can hardly classify them as real human beings. Moreover, given the way some components of their voices are usually overly boosted, videoconference participants are inherently shouting. To what extent can you subliminally trust someone that looks like a human being, speaks like a human being but does not sound like a human being – and sounds aggressive to your nervous system, to boot?
Given that trust, credibility and rapport are the secret ingredients of every successful human interaction, if any meaningful results are to come out of a meeting videoconference, participants must be able to come across to their counterparts (colleagues, superiors, employees) as credible, authentic and real, and not cause them unwanted and unconscious fight-or-flight reactions.
For an HR manager, making videoconference sound natural could therefore prove to be much more useful than offering staff stress-prevention, conflict-resolution or empathy-based communication training sessions. In the age of videoconferencing, coming across as a real human being is the softest skill your employees can learn.
But can videoconference sound natural? And, if so, how? The following article offers a number of effective solutions:
Bibliography
1. Cranford, S. Zoom fatigue, hyperfocus, and entropy of thought. Matter 3, 587–589 (2020).
Wiederhold, B. K. Connecting through technology during the coronavirus disease 2019 pandemic: Avoiding “Zoom Fatigue”. Cyberpsychol. Behav. Soc. Netw. 23, 437–438 (2020).
Riedl, R. On the stress potential of videoconferencing: definition and root causes of Zoom fatigue. Electronic Mark. 32, 153–177 (2022).
2. Riedl, R. On the stress potential of videoconferencing: Definition and root causes of Zoom fatigue. Electronic Mark. 32, 153–177 (2022).
Denstadli, J. M., Julsrud, T. E. & Hjorthol, R. J. Videoconferencing as a mode of communication: A comparative study of the use of videoconferencing and face-to-face meetings. J. Bus. Tech. Commun. 26, 65–91 (2012).
4. Riedl, R., Kostoglou, K., Wriessnegger, S.C. et al. Videoconference fatigue from a neurophysiological perspective: experimental evidence based on electroencephalography (EEG) and electrocardiography (ECG). Sci Rep 13, 18371 (2023).
5. Wilson, Gillian & Sasse, Angela. (2002). Investigating the Impact of Audio Degradations on Users.
6. Dos Santos, T., Bordiga, P., Hugonnet, C., Avan, P., “Musique surcompressée, un risque auditif spécifique”, Acoustique et Techniques, 99, 24-31, 2022.
7. Rotter K.R.G., Atherton M.A., Kaymak E. and Millar B., Noise Reduction of Dental Drill Noise, Mechanotronics 2008.
8. Fredriksson S., Hussain-Alkhateeb L., Torén K., Sj?str?m M., Selander J., Gustavsson P., K?h?ri K., Magnusson L., Persson Waye K. The Impact of Occupational Noise Exposure on Hyperacusis: A Longitudinal Population Study of Female Workers in Sweden. Ear Hear. 2022 Jul-Aug 01.
9. Wiley T.L., Oviatt D.L., Block M.G.: Acoustic-immittance measures in normal ears. J Speech Hear Res. 1987 Jun;30(2):161-70.
Toker M.A.S., Güler N. General mental state and quality of working life of call center employees. Arch Environ Occup Health. 2022.
11. By way of example:
Westcott M. Acoustic shock injury (ASI). Acta Otolaryngol Suppl. 2006 Dec
Pawlaczyk-Luszczynska M., Dudarewicz A., Zamojska-Daniszewska M., Zaborowski K., Rutkowska-Kaczmarek P. Noise exposure and hearing status among call center operators. Noise Health. 2018 Sep-Oct.
12. Garone, A.: Reported Impacts of RSI on Auditory Health at International Organisations https://drive.google.com/file/d/1OfnkWECRYJgCxNb8jxdVc_tpajMPm5id/view?usp=sharing
13. Eryn J. Newman and Norbert Schwarz, Good Sound, Good Research: How Audio Quality Influences Perceptions of the Research and Researcher, Science Communication 2018, Vol. 40(2) 246 –257.
14. More information can be found here: https://www.dhirubhai.net/pulse/does-your-voice-sound-credible-here-why-viewers-switch-andrea-caniato/
freelance conference interpreter
11 个月Thanks for sharing!
EU accredited freelance Conference interpreter
11 个月Sound over processing by platforms, headsets and mismanaged conference systems has led to serious hearing damages for many of us. I share the same concerns, and I am convinced that this article should overcome the boundaries of simultaneous, interpreting and reach a much wider audience. Although serious consequences will arrive later, for people who do not interpret, chances are very high that they will develop issues. I have already been conducted by a couple of persons who do not work as interpreters, they just do a lot of video conferencing and they have developed strong tinnitus and cochlear damage. Apologies for dictation errors, writing quickly from my smart phone.