Is Zoom “Hi-Fi Mode†an Answer to Interpreters’ Woes? Not at All!
Cyril Flerov
Russian Conference Interpreter at InterStar Translations, Member of AIIC and Executive Secretary of TAALS
Executive summary: using hi-fi mode on Zoom for multiple participants in a conference is not what the mode was originally designed for. It requires using professional audio equipment. In standard mode problems come from strain because of poor speech intelligibility. In hi fi mode - if not used right - problems come from added unprocessed noises (room acoustics, background noise, unmodified sound peaks etc) combined with better speech intelligibility. That is why without logistical measures to improve "regular" user hardware and strictly manage microphones the hi fi mode will be of no additional value. Or it will be just glorification of laptop built in microphones that will indeed sound much better thus incorrectly justifying blanket use of built in microphones in the first place. Proper testing of each participant audio in advance in the hi fi mode is required.
Text:
This is just a quick personal opinion about how the Zoom hi-fi mode may change the remote simultaneous interpretation [RSI] game.
One of the major issues that interpreters have online during remote simultaneous interpretation is speech intelligibility that is significantly reduced by audio processing algorithms that try to save bandwidth and as a result distort sound with low STI values [speech transmission index, see more information here https://bedrock-usa.com/speech-transmission-index/ ].
Reduced speech intelligibility does not result in hearing loss. As science understands the situation today, only high intensity sound peaks/sound levels have potential to reduce or actually reduce interpreters hearing according to professional audiologists, and the threshold is also very person specific, in some populations that hearing loss may start at very low decibel numbers, even in high 70s dB.
Everything else such as headache, dizziness etc. is “nonauditory effects†of online sound and while may be very unpleasant, does not damage hearing per se. I’m not saying that the symptoms do not need to be handled or taken care of, I’m simply saying that a lot of the strain that interpreters feel during RSI does not seem to result in hearing loss as it is now understood by audiologists. Any opinion to the contrary, for example saying that loss of frequencies can result in hearing loss, is not supported as far as I understand by professional audiologists.
So when we talk about “trauma†or “symptoms†in interpreters we need to very clearly distinguish between physical damage which can also be temporary or permanent and psychoacoustic effects that may be unpleasant and last a very long time but are not permanently damaging by themselves. Unfortunately, the word “trauma†is being used now too loosely and vaguely to designate an entire array of very different issues.
That is why the problem of online sound quality needs to be divided into two completely separate subsets:
1] hearing protection for which I again [sorry about beating the dead horse, because for me-in terms of hearing protection-that horse is indeed dead and buried and the hearing protection topic is closed] strongly recommend using properly configured DBX 166xs Compressor / Limiter / Gate [ https://dbxpro.com/en/products/166xs (at the end of the signal path just before the headset) configured with very low attack time, very high release time, very low dB threshold value raised if necessary, and infinity:1 ratio ("brickwall" mode)] which does not allow sound of any kind to rise above a threshold preset by the user to the user’ s convenience.
Using simply a limiter that limits loud sounds to 94 dB as per ISO standard is not enough online because in that case any peaks below 94 dB will not be attenuated to the preset level and will continue to have negative impact on interpreters. In other words, the interpreter will still experience variations in sound peaks with all their ill effects.
It has to be a compressor that [it is not a correct definition, but to make things simpler] makes loud sound “softer†and soft sounds “louderâ€, in other words brings the audio to the same even level that is again raised after that in the headset amplifier to the level convenient for the interpreter.
In technical terms it is called “compressing the dynamic rangeâ€. (see more on dynamic range compression, in particular, on using it in modern active hearing protection circuits here: https://en.wikipedia.org/wiki/Dynamic_range_compression )
If properly applied and configured, audio compressing protects interpreter hearing by not exposing the interpreter to varying sound peaks and improves the nonauditory effects of online sound’ s symptoms by reducing or eliminating dizziness, headache, tinnitus, hyperacusis, etc.
In my case I have been using the hardware since June 2020 and I spend between two and six hours online in videoconferencing of various kind every day on workdays. Before I started using the hardware I had been using a regular USB headset and I had developed dizziness, headaches, and more importantly hyperacusis when the brain anticipates loud sounds and dreads them, however, after about a month of using DBX, all the symptoms completely disappeared and I can stay online in a videoconference for as long as necessary without any limitations on time or any symptoms of hearing loss or discomfort.
2] improving speech intelligibility. This is definitely the second most important concern for interpreters after the hearing protection. This is where the question of correct frequency range reproduction comes into play [loss of frequencies in audio feed does not result in hearing loss although some “experts†proclaim the opposite].
There’s a lot of talk lately online that the Zoom hi-fi mode may be a solution for remote simultaneous interpretation. The zoom “hi-fi†mode was introduced by the company last year.
In a nutshell, that mode is an enhancement of the “Original Audio†mode and disables “echo cancellation & post-processing, while raising audio codec quality to 48Khz, 96Kbps mono/192kbps stereo for professional audio transmission in music education and performance applicationsâ€.
As you can see in the description on the company website here [https://blog.zoom.us/high-fidelity-music-mode-professional-audio-on-zoom/ ] “High-Fidelity Music Mode delivers professional audio from a single Zoom client, streaming to one or more listeners, for performing arts and music teachers, songwriters, and anyone else looking for rich, professional-grade sound over Zoom.â€
Therefore, the hi-fi mode is a very different use case than an online conference with multiple speakers. It is designed to deliver from one broadcasting point to many silent receiving points.
It disables a lot of the artificial algorithms that indeed make sound choppy and reduce speech intelligibility.
However it imposes a number of additional requirements and we need to analyze carefully if that mode may be suitable for RSI.
In particular, even Zoom states that “[p]rofessional audio interface, microphone, and headphones [are] required.â€
Let’s analyze what that use case and requirements actually mean:
? the use case for which the hi-fi mode was originally created [semi professional or professional music broadcasts to listeners] is very different from what some propose to use it that is if everybody starts using it in a meeting with multiple participants.
? The hi-fi music mode imposes extremely high requirements on the broadcaster. In particular, it is working from a professional studio or an area that is acoustically treated with sound absorption panels on walls or at least something similar to a microphone shield and extremely low noise level/noise floor [see example of shields here https://www.amazon.com/TONOR-Microphone-Isolation-Absorbing-Reflector/dp/B078WNW4YW/ref=sr_1_6?dchild=1&keywords=microphone+shield&qid=1613236541&sr=8-6 or at least a Kaotica https://www.kaoticaeyeball.com/ in case of desktop or suspended microphones.
In case of headset microphones there is really no other option except sound treating the room which may be extremely expensive for interpreters and completely impractical for regular participants. No average user in sound mind would use a portable booth like this either although for professional voiceover artists it is a godsend https://www.sweetwater.com/store/detail/ISOVOX2--isovox-home-vocal-booth
? a professional headset and a professional microphone must be used. While it is not a problem for interpreters, we are already doing it anyway, a regular participant who still uses iPhone earbuds or no headset or a microphone at all except for the microphone built in into the laptop, will definitely not go into the expense or trouble.
As a result, with disabled echo cancellation, increased frequency range, and disabled postprocessing, the conference may soon turn into a nightmare of unattenuated rustling, bangs, clicks, and barking dogs.
After all, sound processing and algorithms are there for a reason. Yes, in the early days of videoconferencing it was important to save bandwidth by reducing audio and video streams to minimum bearable quality. Nowadays, with the technology improvements, I do not believe that is that much of a concern. Other processing however has a positive impact even though it may deteriorate sound quality significantly.
Echo cancellation for example prevents loud feedback and helps in cases when the speaker is not using a headset. Yes, it deteriorates quality of the sound significantly but plays a positive role too. “Gating†allows to cut off the low intensity humming noise from the microphone so that the speaker’s mic is heard only when the speaker is actually talking. That results in the “silence†in the conference that we hear when nobody’s talking but everybody’s microphone is on.
The zoom hi-fi mode [which again was not intended for the scenario] will completely eliminate that processing.
Even with my Shure SM7B very decent broadcast quality microphone and DBX channel strip with Cloudfifter it took me several days to set up the system correctly and eliminate most of the artifacts that were originally happening. As a result I achieved relatively decent audio quality in the “hi-fi†mode, but even then, I was unable to reduce room echo sufficiently [I’m getting Kaotica, let’s see if that helps] nor isolate from street noise that the microphone can still occasionally pickup despite its cardioid pattern.
Imagine what kind of cacophony we are going to have in online events if the “useful†processing is disabled?
It has four implications:
1] even if it has any chance of working, it imposes an extremely high load on participants who will need to upgrade their equipment not just to “decent†but to professional standards. In practical life when some people cannot tell the difference between their loudspeaker and their microphone, it is very unlikely to happen. Some users, for example, senior officials, feel they are too good to wear a headset anyway. Using the hi-fi mode will need to be preplanned in advance with a lot of preparation and testing of each end-user. It may be possible only in advance with a small number of participants and not for spontaneous “anybody can join†events. If proper event management is not being done in real time, it will actually increase acoustic load on interpreters, stress them more, and create more fatigue even though it may improve speech intelligibility.
2] it also imposes an extremely high load on conference organizers. They cannot simply start an event on Zoom and forget about it. In my experience, an event has to be very carefully planned, set up and moderated. For example microphones must be disabled by default when a new user enters the conference. Special moderators need to be assigned for microphone management and to troubleshoot issues during the event. Many clients will not simply go into the expense or trouble. There is also going to be no way that a person with a bad microphone will be able to speak if there is a lot of interference because there will be no algorithms to squelch that interference in the first place and make the person at least barely audible.
3] it imposes even more stringent requirements on hearing protection and not only for interpreters but for all participants. In the scenario for which the hi-fi mode was intended, it is extremely unlikely that the music performer will suddenly drop the microphone. The “hi-fi†mode when used with inferior microphones and even “average†background room noise will result in an additional acoustic load and loud bangs not mitigated by the algorithms. If interpreters or participants do not have appropriate hearing protection, it will only result in more fatigue and potential for actual acoustic trauma. Therefore, by using the “hi-fi†mode in the scenario it was not originally intended for we are actually potentially endangering interpreters and participants.
4] it is unclear how the hi-fi mode will interact with RSI platforms. The Achilles heel of the RSI platforms is low speech intelligibility index due to their own processing and trying to save bandwidth to save money. In the AIIC test published recently the Zoom in standard mode had the worst speech intelligibility index, but most RSI platforms were not very significantly better.
Even if Zoom delivers “decent†sound quality but interpreters listen to it on the RSI platform it will really be of no help, because in that case the RSI platform will be the bottleneck.
Ironically though, if everybody in the event tries to switch to the “hi-fi†mode people will understand very quickly how deficient the equipment is. From there, there are only two ways: everybody will upgrade to professional or at least “decent†equipment and interpreters will live happily ever after or [ in my opinion it is more likely] people will simply say that “hi-fi mode is not working for us†and will switch back to the inferior sound quality to the dismay and frustration of the interpreters. But, they are users, and users do not look at the process and do not want additional complications, as long as they get the result. After all, do they really think about how complicated simultaneous interpretation is? No they just want it to be happening when they need it. And with the “standard mode†they are already getting the interpretation, so why switch to the so-called “hi-fi†and get all kinds of audio artifacts and loud noises?
Only time will show if the “hi-fi†mode will work.
While it will work admirably for music broadcasts or lectures [I delivered one in this mode only a couple of days ago], I have serious doubts that clients will want to adhere to more stringent standards just to improve sound quality for interpreters.
This is a trend that we see often in technology: Betacam was allegedly better than VHS, but VHS won for reasons that were not related to technology. MP3 is vastly inferior to lossless music formats, however, we are using it for convenience.
My concern is that “convenience†will be the last nail in the coffin of the Zoom “hi-fi†mode in conferences with multiple speakers. While the hi-fi mode does indeed provide somewhat better sound quality, it comes with a cost, and I certainly would not consider that mode a panacea or a silver bullet. Only time will show how it plays in practical terms!
Intérprete de conferencias, socia de AIB, Agrupación de Intérpretes de Barcelona
3 å¹´Hi, Cyril, I managed to find this article again (requested via message). Very helpful and important.
Consultant Conference Interpreter (AIIC, ATA) 顾问å£è¯‘åŒä¼ , Chief Interpreter of Shangri-La Dialogue é¦™æ ¼é‡Œæ‹‰å¯¹è¯ä¼šé¦–å¸åŒä¼ , Founder and CEO of Green Terp Technologies 绿译科技创办人兼CEO
3 å¹´Thank you for your article. Just revisited it, and have a quick questions here: if the assumption is all inputs from speakers on Zoom are equivalent to onsite level (professional microphones used in hotels and conference centers), would Zoom hi-fi mode be able to provide ISO-standard (de facto higher) quality sound to interpreters at the other end? Thanks.
Russian Conference Interpreter at InterStar Translations, Member of AIIC and Executive Secretary of TAALS
4 å¹´Now we should see how the hi-fi mode will behave in real hardware in real people with all the bandwidth and other limitations in RSI situations. The additional complication of course is that hardware configurations are so different for everyone as well as Internet connections. The size of the event also matters so it is very difficult to make any projections.
With this article, Cyril Flerov and friends are basically warning people against the use of systems that faithfully reproduce the original sound from a speaker's microphone (or that attempt to do it). Now AIIC is testing RSI platforms for ISO compatibility, and has ISO compatibility among its working conditions. An ISO compatible platform will by definition provide a faithful representation of the speaker's microphone in the 125-15kHz range +/- 3dB, which can NOT be done if algorithms are processing sound live. Any ISO-compliant system will reproduce all background noises. I have to conclude that our author and his friends cannot but be AGAINST RSI platforms becoming ISO compliant, as that would produce the type of "problems" readers are being warned against, and recommend using non-compliant systems instead. Which is an embarrassing thing to argue for someone who sits in AIIC's Technical and Health Committee. I would love to hear what Klaus Ziegler has to say about this. Sorry about involving you Klaus, I know you don't like it, but this is EXTREMELY serious.
Seasoned On-site and Remote Conference and Court Interpreter (English<>Spanish and BR Portuguese to Eng. and Span)
4 å¹´Does "Hi-Fi" expand the range of sound? Yes? Does that help when there is a perfectly awful microphone (like my onboard HP)? Some... does that off-set the additional background noise? "See" for yourselves... https://youtu.be/jsE-ToQsqxw