Fear & Loathing in Voice Authentication - 10 Key Ideas on Deepfakes
This will be a series of 10 blog posts, each introducing one key idea on how we can function in the era of Generative AI-quality deepfakes. We will explore this topic particularly through the lens of how we can establish trust in verbal communication via voice authentication technologies.
This series of articles aims to share my perspective as a voice authentication, fraud detection and deepfake detection (antispoofing) AI research leader, as well as that of my research team and other colleagues who, collectively: work in industry, build the voice authentication products used by industry, and have decades of real-world experience in how these technologies operate in the field. Throughout our many years in this space, we’ve had successes, have seen fundamental threats to the core technology, and earned our share of grey hairs with lessons-learned along the way.
Each article will introduce one Key Idea on voice deepfakes and how they intersect with voice authentication. The first few Key Ideas should be general enough as to be appreciated by anyone; minimum entry fee is your curiosity.
As we progress through the Key Ideas, they very much become geared towards researchers and technologists developing voice biometrics, custom text-to-speech, deepfake countermeasures, and a variety of related AI-based (including Generative AI) technologies. We will share recommended guidelines on how to effectively develop voice authentication and deepfake detection technologies. The last few Key Ideas will aim to highlight some of the present dangers of deepfakes and raise questions and concerns on how the scientific and technologist communities can move forward more conscientiously.
I hope that throughout the 10 Key Ideas on Deepfakes, you will always find something illuminating if you’ve ever been concerned or curious about deepfake voices and/or voice authentication technologies. The first Key Idea is a doozy and is basically a cliffhanger, so tune in every week as we shed light on the dangers and discuss solutions around deepfakes and voice authentication.
Key Idea 1 – the main threat of deepfakes today is that they have defeated the human ear
If no one else has already done so, I will declare 2023 to be the year that AI sold out and went mainstream. It was no longer the darling of cool geeks?in the know, eliciting wide-eyed wonder when you told people that you "worked in AI." Impressive, perhaps, but detached from most folks’ daily lives. In 2023 AI had gone truly BIG and wasn't just going to replace you at your job; AI was now going to replace?you, period.
Custom TTS (text-to-speech) technology has existed for decades, but it is only in the past few years when it has reached a level of lifelike realism as to be virtually indistinguishable from real human speech by the human ear.
And unsurprisingly we have already seen, or there have been reports of, abuses of such technologies as in the Elon Musk cryptocurrency scam, U.S Elections meddling, or the now-classic CEO fraud (another case here).
The public’s fear of lifelike deepfakes grew to a fever pitch throughout 2023 and 2024, and with good reason (AI is coming for your grandma).
So how good are these so-called "humanlike" deepfakes, really?
Last year (2023) researchers from University College London (UCL) published a study on how successful humans would be at detecting if an audio clip was real or a deepfake. They reported a 73% accuracy; this accuracy was achieved with a precision of 68.7%. Put differently, the false positive rate was about 31.3% (humans were falsely identified as deepfakes about 1 in 3 times).
To get an idea of what participants were up against, it was likely deepfakes of a similar quality to the VITS samples found on this page, so you can listen for yourself.
Another study appearing in Nature Communications expands on this type of analysis by exploring all the clues and giveaways that can help us discern deepfakes from reality, but nonetheless confirms the challenges of audio deepfake detection by human listeners. In one experiment participants were able to identify audio deepfakes with 72% accuracy, although false positives were only 14% here. For this particular experiment, participants were presented with audio deepfakes as part of videos, which gives them an advantage. The use of audio deepfakes, however, made it such that these cases were still the hardest for participants to detect, by a large margin, compared to other forms of fakery such as voice acting.
In both studies random guessing would have meant an accuracy closer to 50%.
These results were achieved when participants expected to be faced with deepfake audio, and in a trial where the number of deepfakes was about the same as that of natural speech audio clips. So human performance was not random, but far from great.
In the real world, where the use of deepfakes is likely to occur when it’s unexpected, and in far smaller proportions than interactions with real people, how would people fare at noticing if a voice was fake?
The Nature Communications study also aimed to answer that question, perhaps a better simulation of what would happen day-to-day in the real world. Participants’ performance at spotting deepfakes was now far worse. Furthermore, the authors found that audio deepfakes were much harder to spot than video deepfakes. Participants suspected that an audio-only recording was a deepfake 13% of the time, and this was with a high false positive rate of 8.6%.
领英推荐
Put differently, humans were fooled by audio deepfakes 87% of the time, while still being (wrongly) suspicious of 1 in 12 real audio clips
If you’re not yet convinced that the human ear has already lost the battle against deepfakes, I think we can agree that we’re very close to reaching that point.
You would color me Captain Obvious when I then say that for the scientists and technologists building the custom voice/voice clone TTS technologies, fooling the human ear perfectly is the ultimate goal, their holy grail.
Researchers and technologists developing new TTS algorithms and models each year evaluate their creations using the Mean Opinion Score (MOS) metric, based on human judges’ assessments of speech quality. MOS (and its variants) is fundamentally a subjective metric about how lifelike/natural a speech segment sounds to the human ear.
Leading systems developed by Microsoft, Meta, Google, as well as others, all boast perfect or near-perfect lifelike performance of their voice clone technologies. The machines have become more real than real!
The above table compiles self-reported subjective MOS scores from VALL-E 2, MELLE, and Voicebox; they are on a scale of 1-5, where 5 means perfect human-level naturalness (or perfect match to the target person), and a 1 is “not at all convincing.”
The variance across reported scores for the same engines reminds us of their subjective nature, so better not squint too hard at the numbers, but the message is clear: these engines (and many others not covered here) claim perfect or near-perfect level of naturalness and similarity to the original speakers. Google’s Tacotron was shown to have achieved human-level naturalness as far back as 2019.
What are the implications of this? Are these human-level AI voices running rampant all over the internet and making phone calls to our loved ones or the various institutions we deal with?
On the one hand, all flavors of Microsoft TTS technology remain locked away from the public due to the risk of abuses, and so is Meta’s Voicebox, as well as the counterparts from some of the biggest players in AI.
But many other sophisticated voice clone tools are out there and readily accessible by anyone in the public.
Although custom, lifelike TTS voices can be used for legitimate purposes, it’s no surprise that they can also be used to target or impersonate high-profile individuals by criminals for large financial gain.
But what about the rest of us? What level of risk or exposure does the average, non-famous person have, and what can be done to help address any deepfake threats today and in the future?
Regulators across the globe hurried to post PSAs on the destructive potential of deepfakes at the societal level (also in Europe). While there has now been legislation to make using AI voices in robocalls officially illegal, this is unlikely to prevent or deter criminal use of deepfake technology.
And there are other gaps. For instance, defining and establishing in a court of law what is and what is not a deepfake are unchartered waters, with all bets placed on researchers and technology providers to figure it out: if polygraph testing tread rough terrain as an effective tool against deception, how much more difficult or complex do things get when it comes to detecting deepfakes? How many "That wasn't me, that's AI-generated" pleas will we begin to see in court cases?
There is a happy path forward, one that is available to us today. It will entail technological solutions such as voice authentication, deepfake countermeasures (aka antispoofing technology), and recommended updates to existing practices and awareness around all the AI technologies involved.
tldr; studies show that the latest in lifelike audio deepfake technology has entirely defeated the human ear, and this is by design. Solutions exist today, and there are important practices and awareness for us to update to secure ourselves in the future
So tune in next week (you can subscribe to this newsletter to get notified as soon as the articles go up), and in the meantime I’d love to hear your thoughts on Key Idea 1 in the comments!
Do you agree that humans can now be easily fooled by a modern-day audio deepfake? Have you or anyone you know been affected by fraud where you suspect a deepfake voice was used?
CEO of NeuroBot | Use generative AI to help engineers create synthetic data for training and testing AI models. Access open-source synthetic datasets ??
3 周Abu, I couldn't agree more! The Microsoft AI Tour 2025 was truly mind-blowing. As CEO at NeuroBot, I'm fascinated by the potential of AI and would love to connect and explore collaboration opportunities. Let's connect!
This is a critical and timely discussion, Haydar. It's clear that deepfake technology has far surpassed our natural perceptive abilities, and your point about the legal and ethical challenges is especially insightful. It seems clear that a multi-pronged approach is needed—leveraging advanced voice authentication and anti-spoofing technologies, coupled with public education and robust legal frameworks. I look forward to seeing how your upcoming articles explore these solutions and offer practical guidance for navigating this evolving landscape. On a personal note, I’m curious to hear your perspective on how we can balance the rapid development of TTS and deepfake technologies with the imperative to safeguard trust. Is there a way to ensure accountability from developers and maintain innovation without enabling misuse? Eager for your insights!
Authentic Leadership | Strategic Transformation | Financial Services | Risk Management | Customer Contact | Biometric Security
3 个月A great article - I'd also make the point that - in a commercial setting, it is possibly the case that a human agent may not even be motivated to listen out for a simulated voice. Accustomed to disrupted phone lines, speakerphones, dodgy digital headsets - a part time contact centre employee perhaps taking the last few calls of their shift, while simultaneously completing their wrap up notes from a day's work may think nothing of a voice that says all the right things, but sounds a bit "off".
Director, Software Engineering at AeroSafe Global
3 个月This is great to see Haydar… and written very much in your familiar style that I always enjoyed so much! Looking forward to the whole series!
Excited to read this series