Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

In the world of voice biometrics authentication (I will interchangeably use voice authentication to mean the same thing), the hypothetical threat of deepfake/synthetic voice-based fraud has been well-known for over a decade. But the fear and loathing of deepfakes broke through to the public in 2023, wasting no time from the earlier demonstrations of hyperreal artificial voices to how they might affect the reliability of voice authentication. We can choose this?article published online by Vice UK?as the symbolic beginning.

Based on The Vice article and several other copycats (Guardian Australia,?Wall Street Journal), it would seem that the existence of deepfakes has made voice authentication obsolete. For now we will put aside the problematic and biased nature of such demonstrations, and instead spend our time sharing why the opposite conclusion is true.

As we take a more systematic approach to evaluating the vulnerability of a voice authentication system, we will come to realize that if we are indeed worried about the threat of audio deepfakes, then voice authentication is a powerful tool to shield us from them.

To set the context, a modern voice authentication system will typically provide a score evaluating whether the voice in a given audio clip matches the voice of the person the speaker is claiming to be. When people call regarding their own bank accounts, mobile service, etc, the score will generally be higher. And when fraudsters (using their own, plain voice) phone a call center to access the accounts of their victims, a high-quality voice authentication system will return lower scores.

In this example, Jane is speaking with her bank’s call center representative, who is able to verify that this is indeed the real Jane using voice authentication

For a given system, you would normally choose a cutoff point, or threshold, for these scores, on which to base automatic decisions confirming if each caller is, in fact, who they claim to be (and to refuse access to anyone else). The below graphic illustrates this concept.

The range of voice authentication scores for different types of individuals. Fraudsters (purple) will tend to score lower against the voice profiles of legitimate account holders (green). All cases appearing to the left of the threshold (dotted line) are rejected, whereas everything to the right will be deemed authentic

One can see that each type of person (legitimate account holder or fraudster) has their color-coded set of scores, which are not quite a Gaussian/bell-curve, but close. This graph shows the voice authentication performance, on a real-world dataset, of a modern system based on a common flavor of Deep Neural Network known as the ResNet. In this case, we see a very nice separation between fraudsters and legitimate speakers, so choosing a threshold separating the two (at a score of around 3.9) is straightforward enough; this allows us to authenticate 98% of legitimate persons while rejecting fraud attempts 99.9% of the time.

Recalling the internet headlines that would have us believe deepfakes could now fully circumvent voice authentication, let’s take a closer look at what’s happening behind the scenes. If we add the scores generated by deepfakes and other spoof attacks (e.g. recordings of the victim’s voice), it could look like the following graph.

The voice authentication scores for spoof audio (includes deepfakes and other technology-based impersonation methods) are now added, in orange, to the previously shown graph

We now see an example of how a fairly modern voice authentication system performs against spoof attacks. The spoof attacks of this particular experiment spanned several deepfake technologies, some of higher quality than others, in a real-world setting.

This voice authentication system was able to deter over 80% of attempted spoof attacks. On the one hand, the result shown here does not guarantee this level of performance for all systems, and it is likely that the highest quality deepfake attacks would have more success than lower-quality ones. On the other hand, by moving our threshold more to the right, we could increase the system’s robustness to deepfakes (at the cost of inconveniencing a greater portion of the legitimate population).

Regardless, the message is clear: voice authentication’s resistance to deepfakes is far better than what humans are able to achieve, even when expecting the presence of deepfakes in audio clips (see Key Idea 1). A recently published paper at Interspeech draws similar conclusions.

Evolution of voice authentication error over time. Authentication accuracy (purple) shown with respect to key voice authentication technologies, order from oldest to newest (left to right). Also shown is error with respect to deepfakes (orange) for the same technologies. Source: Jung et al, Interspeech 2024

As shown in the above graphic, authentication error improves over time as new types of systems are developed. Note the term EER (equal error rate), which is a common metric to assess biometric system accuracy (some general info here); intuitively, we want to minimize this type of error (zero would be nice).

While older voice authentication systems show some resilience to deepfakes, the error rates are still quite high, whereas their modern counterparts showcase much improved robustness to such attacks.

To situate ourselves in years, the i-vector was introduced in 2011, the x-vector in 2018, and WavLM-ECAPA is a contemporary (2024) system.

Voice authentication clearly provides a substantial level of defense against deepfakes, but it must be kept up to date. And, most importantly, research into new voice authentication models must continue; and we must continue with this second objective of being robust to audio deepfakes.

Though we must maintain a frantic pace to keep voice authentication systems up to date, the trend thus far in terms of advances in lifelike voice synthesis has been only a gradually rising threat against the voice systems. You can actually read this from above graph, if you follow the orange dotted line from right to left.

As we walk backwards, what we see is that the effectiveness of modern deepfake technology against older technology is now trending upward, but only gradually up to the x-vector, where suddenly there’s a large leap in error as we move to i-vector technology.

Fun fact: the Deep Neural Network (DNN) revolution first came to the world of voice biometrics in the form of x-vector technology (seminal paper here), and had a massive, positive impact on voice biometrics accuracy (and resilience to deepfakes).

So, with the exception of when DNNs first emerged, benefitting both voice biometrics and deepfakes in equal measure, all other deepfakes advances since have only been gradually improving in efficacy against older voice authentication systems.

Returning to the first results we shared in this article, now expanding the view to how older systems perform against modern-day deepfake attacks, we notice a similar pattern to that of the Interspeech paper.

Our own analysis of top voice authentication technology of the past few years, on our own real-world dataset. We notice a similar pattern to that of the Interspeech paper: newer systems outperform older ones when it comes to deepfakes

While slightly older systems can provide a strong baseline resistance to deepfakes (and other spoof attacks), newer systems are far more resilient (compare 70% resilience in 2019 vs 80%+ in 2024).

From a fraud perspective, however, based on what we’ve now seen, it does appear that their likelihood of circumventing a voice authentication system has increased from around 0.1% without the use of deepfakes, to around 20% if they were to start using deepfakes.

But this means that voice authentication alone can help prevent 80% of audio deepfake attacks, whereas humans are largely unable to detect deepfakes (see Key Idea 1)

Consequently, voice authentication becomes one of the few, key tools, to help us defend against audio deepfakes. It’s clear that deepfakes are a challenge to voice authentication systems, but for the time being, and through several AI revolutions over the past few years, the technology has held relatively strong.

So if we are truly concerned about how much audio deepfakes could be exploited by fraudsters to commit identity theft, financial crimes or other types of crimes, then we have to seriously consider, if not outright rush to, using voice authentication in more areas to secure sensitive interactions.

As long as the quality of deepfakes and other spoof-related technologies continues to improve, it could be dangerous in the long-term to rely solely on voice authentication technology as a deterrent. Therefore, next week, we will take a closer look at the deepfake countermeasures available to us today; we will see how these technologies complement voice authentication’s ability to resist (and even detect) deepfake attacks.

tldr; perhaps counterintuitively, a modern voice authentication system can help mount a powerful defense against deepfake (and other spoof) attacks, deterring 80% of them

I've attempted to share some important ideas on voice authentication's role in an increasingly deepfake world, ideas which run contrary to some headlines of the past couple years. But do you agree with the analysis and conclusions? Please let me know in the comments below!

Deb H.

Director - Generative AI / Security Solutions - Customer Success at Microsoft

3 个月

Great article Haydar, appreciate your sharing this!

great content... thanks for sharing!

I would again like to thank some folks on my team for their feedback through the course of writing today's article, Luis Buera and Simone Onizzi. And a special thanks to Héctor Delgado for also creating some of the graphs.

要查看或添加评论,请登录

Haydar Talib的更多文章

社区洞察

其他会员也浏览了