登录查看更多内容

Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods

Haydar Talib

发布日期: 2025年1月9日

I have invited Tim Savage and Simone Onizzi to share their insights and guidance for how organizations can approach the testing of voice authentication and/or deepfake detection systems. Tim and Simone have worked with enterprise organizations and their security teams worldwide, advising them on these new practices for voice authentication systems in a post-generative-AI world.

Bite of the Paper Tiger

By Tim Savage and Simone Onizzi

Custom text-to-speech (TTS) technology, which can be used to create lifelike voice “clones” of people, is easily accessible by anyone, no specialized technical skills required.

Throughout this series we’ve talked about how custom TTS technologies can be abused to create deepfakes and perpetuate financial and other crimes. We’ve also talked about how voice authentication and deepfake detection technologies can help detect and resist such fraudulent behaviors. Further, the very same deepfake technologies can (and should) be used to develop voice authentication and deepfake detection in the first place.

But once we have our defense systems ready, how do we go about testing that they actually work?

Journalists have certainly made bypassing voice authentication systems seem pretty easy: sign up for one of the voice clone services, create an “AI” version of your own voice, and call your bank (WARNING: do not try this at home, as you may inadvertently invite security headaches into your life).

Quite the party trick, one that journalists have been repeating for a few years now; it sounds scary on paper, but such tactics are paper tigers and don’t accurately reflect reality. Worse, over indexing on the journalistic demonstrations may lead us astray as we aim to develop effective defenses against the actual fraud threats (including the use of deepfakes), instead exposing us more to their attacks.

Effectively testing the risk of voice deepfakes requires a much broader understanding of one’s business policies, operational realities and the underlying technology that empowers them.

Why do organizations want to test their voice authentication solutions?

Voice authentication software is tested by businesses who already use it, or are in the process of implementing such technology, for the same reason as they’d want to test any other new piece of software or physical asset: does my recent/upcoming purchase work as expected?

Project delivery managers need to ensure that the voice authentication solutions provide a green light to the right tester and a red light to the incorrect tester. Operation managers need to verify how the authentication decisions impact internal workflows and customer experiences. Executives need to ensure that due diligence has been carried out, validating that their recent purchase will deliver the expected value (think KPIs in terms of dollars, whether in reduced operational costs or increased security) in the right timeframe (ROI). These are all valid goals and concerns, ones that proper testing aims to address.

The rise of audio deepfake technology creates a whole new set of fears and adds another dimension to our voice authentication testing strategy. How do you even begin to structure deepfake testing? There are a multitude of deepfake options available, each of them configurable to varying degrees. As of the writing of this article, there is no universal test set or testing strategy in place for deepfakes.

Functional vs probabilistic testing

So how can you test if your protection against spoof attacks (deepfake or otherwise) is robust enough?

Even in the standard scenario (authenticating a person), voice authentication testing is inherently variable. Unlike traditional password-based systems where the input-output pairs are fixed and fully deterministic (you always expect output y when the input is x), voice authentication systems are built upon probabilistic, AI-based algorithms, and can vary significantly. The variability in outcomes is a byproduct of having to adapt to varying real-world factors, including changes in a speaker’s health, emotional state, physical environment, the device used to transmit a person’s voice, or even the day-to-day subtle variations in people’s voices.

The quality of the input to a voice authentication system can also vary dramatically. For instance, our system should be robust to a large amount of background noise while someone is speaking. Does this mean we still “pass” the person, when they’re barely audible? As we work through various scenarios, definitions quickly become murky. But it becomes clear that our testing framework would need to be expanded to cover multiple scenarios and represent a large array of variability, in order to be reliable.

These considerations extend to deepfake audio as well.

If voice authentication and deepfake detection technologies cannot be tested like deterministic software, then how do we proceed?

The answer has already been provided to us by the scientific method.

The framework for testing the accuracy of voice authentication and deepfake detection has already been provided by the scientific method for developing these technologies in the first place

In Key Idea 5 we covered the importance of the datasets used to create and develop voice authentication and deepfake detection technologies. Underpinning our recommendations are scientific principles to help us move towards statistical significance and reliability of AI systems.

As it turns out, we can use the very same principles to define our testing methodologies. For posterity, the 3 major considerations are -

领英推荐

Through the Lens of Cybersecurity | 2025 in Focus:…

Information Security Media Group (ISMG) 1 个月前

The Next Generation of AI Cybersecurity: A…

CyberFame.io 1 年前

23andMe Faces Legal Challenges After Genetic Data…

Indian Cyber Security Solutions (GreenFellow IT Security Solutions Pvt Ltd) 1 年前

A large volume of tests – only when you are able to perform a large number of tests can you begin to make assertions about the accuracy of a system. Or, more precisely: how to configure a system in a way that would be optimal for the whole population, and not overfit on the small subset of your testers. While we could stage a single instance of a deepfake that bypasses a voice authentication system, it does not necessarily indicate a largescale security breach or vulnerability.

A large volume of tests, performed by dozens (or hundreds) of participants, will help ensure statistical significance and improve your chances of meaningful interpretation of results

Realistic tests – the goal here is to reflect real-world conditions. Self-testing, or "breaking in" to your own profile is not a realistic simulation. You have the key to your own house, so it's not a failing of the lock when you walk in through the front door. A caller or fraudster would not have backdoor access to your system (or, if they do, then voice authentication accuracy is the least of your concerns). How would fraudsters try to circumvent defense systems, using tools and data available to them?

Realism in testing will help avoid certain types of bias that can infiltrate AI systems (by overfitting to the wrong data)

The journalistic demonstrations we’ve spoken about are a prime example: a system can be tuned to account for individuals “breaking in” to their own profiles using deepfakes, but this will likely lead to suboptimal performance when the real criminals finally decide to adopt such techniques to commit fraud.

Breadth of tests – to borrow the famous saying, a large volume of one type of test is the statistical equivalent of madness, where we’d be repeating the same thing over and over, yet expecting a different outcome. It is therefore important during testing to collect samples from a wide range of speakers, environments, devices, and deepfake systems to ensure that the system can handle the full spectrum of inputs it would encounter in the field.

What are we even testing?

So far we’ve spent much of our time talking about the technical (scientific, even) considerations when thinking of testing voice authentication and deepfake detection technologies. We started with that topic to assure readers that there is a path forward. However, the primary dimension for an organization when it comes to evaluating risk around its business and authentication solutions may not have much to do with technology after all.

When organizations are faced with new threats that entail complex and technical testing approaches, they sometimes reach out to their info sec colleagues for help. Red Team testing, as traditionally defined, will need to evolve in order to become meaningful in the context of deepfake voice testing. The methodology of penetration tests typically revolves around the following steps:

Review techniques: Evaluate the system, network, policies and/or procedures
Target identification and analysis: Map out and identify vulnerabilities
Corroborate existence of and exploit the vulnerabilities

Let’s now apply the above Red Team framework to the case of authentication within a call center.

1. Methods used to authenticate callers: Knowledge-based questions, one-time-PIN and voice authentication

2. Exploitability of each authentication method:

Knowledge-based questions – Answers are readily available and can be bought in bulk
One-time passwords can be received by compromising the victim’s cell phone account (using the same knowledge-based answers mentioned above)
Voice authentication can be fooled via deepfakes – this requires socially engineering the victim to speak over the phone, record their voice, synthesize it, introduce audio to an IVR and/or orchestrate a conversation with a contact center agent

3. Call into the contact center and access the protected account using the aforementioned exploitations

AI-generated image - if we use a non-statistical, possibly biased, sample of tests to dissuade use of voice authentication, we're actually opening the door for fraud to be more easily committed

A Red Team tester synthesizing their own voice or that of a colleague to bypass voice authentication does confirm that the paper tiger has some teeth. Such a test, however, would be prone to the same limitations as we mentioned earlier from a statistical significance and bias standpoint. Beyond the science of such testing, however, it is also limited because:

It does not address scenarios where users are not enrolled in voice authentication
It does not consider the more severe vulnerability introduced when alternative business processes fall back on using weaker authentication methods

Before we conclude, there is one other dimension for us to consider, perhaps not today, but certainly in a future not too far away: synthetic speech is not inherently fraudulent or malicious. Detecting synthetic speech in the context of customer interactions can increase operational costs, frustrate legitimate customer and increase fraud exposure.

Consider a customer who has set up a voice auto-attendant of their own voice, while waiting in a call queue with their bank. The first 5-10 seconds of the call with the bank would be a synthesized voice, saying something like “This is an AI agent calling on behalf of John Smith; please wait a moment while I get John to pick up.” When deepfake detection technology correctly flags the synthesized voice, what are the operational steps thereafter? Does a fraud analyst take the call (increase operational costs), do you force the customer to use knowledge-based authentication (increased fraud exposure) or do you deny service altogether (customer frustration)?

Voice authentication solutions should not be viewed as a wall where pass/fail are the only outcomes. All solutions are a highly configurable membrane where the size and quantity of what passes through is directly controlled by the business objectives. Demonstrating that a needle can be threaded through a fishing net (voice authentication) does not qualify or quantify how much fish (fraud) you are going to catch.

Fraudsters leveraging deepfake voice technology represent a very small portion of the voice authentication threat landscape. A fraudster’s goal when attacking a contact center is not to show off their audio manipulation prowess, but rather to gain access to an account as easily as possible. Traditional penetration testing exposes vulnerabilities in a system or network: in the context of voice authentication, we propose that the question should not mainly be if deepfake technology can be used to obtain privileged access, but more generally how can voice (or other method) be used to obtain privileged access in a given interaction channel (such as a phone call).

tl;dr similar to the requirements for developing voice authentication and deepfake detection effectively, you need realistic, rich and abundant data when testing the accuracy of these technologies. Thankfully, the scientific method already exists, and should be incorporated into testing methodologies

10 Key Ideas on Deepfakes

266 位关注者

Cliff Mann

1 个月

I'd also point out that it is human nature to assume that something is working (or not) based on your own experience of it. A restaurant is bad [or good] because you had an overcooked [perfect] steak, a bank is bad [or good] because the credit line they offered was lower [higher] than your current provider, a delivery firm is bad [or good] because the driver dropped your parcel on your cat [handed it to you personally]. The temptation can be similar when testing probabilistic solutions. Do you allow yourself to be "turned on a pin" because of an observation from a targeted test case? It is equally dangerous to take false comfort from a result that you like as it is to over inflate a worry from one that you don't. As an aside : I recall early in my career an angry executive demanding changes to retail credit policy, because their spouse didn't get the credit limit that they wanted. Exceptions and edge cases are great ways to inform workarounds, mitigants and exceptions - but it's no way to set policy! Suffice to say, we awarded the credit limit, but we didn't change the policy.

1 次回应

Cliff Mann

1 个月

A great read! I love the needle & fish net analogy & use it regularly. I fully get the desire to test. As organisations face new threats & deploy novel solutions to mitigate them, they must gain buy in from many (often skeptical) actors: customers, colleagues, executives, regulators ... Model governance frameworks, in particular post financial crisis, almost compel us to test everything to destruction. The most valuable skill is the ability to design & execute meaningful test plans & to be able to comprehend the results that you generate in the context of your deployed solution. Your piece carries some sage advice! If I had just two key things for test planners to remember: A precisely designed, pin point, edge case test can be highly informative in illustrating the limitations of your solution - but is almost certainly not representative of the lions share of the use cases you will see ... or even where its value lies. Don't conflate the two. Particularly when dealing with solutions to mitigate novel threats, don't allow unrealistic expectations of perfection to be the enemy of "good". Furthermore, don't dare to assume that you know what perfect looks like - that is where complacency lies!

1 次回应

Haydar Talib

1 个月

BONUS CONTENT: for me, the Holidays are usually a great time to get caught up on personal projects, sleep, hobbies, and READING (also a hobby, yes). Somewhat relevant to today's article, my book recommendation emerging from this latest Holiday season is "Everything is Predictable" by Tom Chivers: https://www.goodreads.com/book/show/199798096-everything-is-predictable A really nice, accessible intro to statistical thinking and methods, with some nice history to boot!

1 次回应

查看更多评论

要查看或添加评论，请登录

Haydar Talib的更多文章

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

2025年2月12日

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

And now, finally, the full list of 10 Key Ideas on (audio) Deepfakes, through the prism of voice authentication…

9 条评论
Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

2025年1月31日

Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

Hopefully I can get away with a bit of humor in the heading of today's article, but make no mistake: I will (in some…

5 条评论
Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

2025年1月23日

Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

The title's a mouthful, but it's made up of several key points; let's unpack it. Last week my colleague Cliff Mann…

4 条评论
Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

2025年1月16日

Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

People still want to talk to people. It may seem an obvious statement to some, a mark of passé thinking to others.

3 条评论
Key Idea 6 - build synthetic speech (deepfake) technology responsibly

2024年12月19日

Key Idea 6 - build synthetic speech (deepfake) technology responsibly

We started this series by outlining how audio deepfakes have already defeated the human ear. By design.

5 条评论
Fear and Loathing in Vancouver - NeurIPS in a Nutshell

2024年12月16日

Fear and Loathing in Vancouver - NeurIPS in a Nutshell

The first thing to note about #NeurIPS2024 is that there is so much of it. After a dizzying few days (and as I write…
tl;dr Five Ideas on Deepfakes in Five Minutes

2024年12月12日

tl;dr Five Ideas on Deepfakes in Five Minutes

If it's true that we should be concerned about the rise of deepfakes, then I have some bad news to share - 1. The human…
Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

2024年12月5日

Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

In Key Ideas 3 and 4, we laid the building blocks for the technological foundation that scientists and technologists…

1 条评论
Key Idea 4 – build your deepfake countermeasures alongside voice authentication

2024年11月27日

Key Idea 4 – build your deepfake countermeasures alongside voice authentication

Last week (Key Idea 3) I shared some data on voice authentication’s robustness to audio deepfakes, including how newer…

4 条评论
Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

2024年11月21日

Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

In the world of voice biometrics authentication (I will interchangeably use voice authentication to mean the same…

3 条评论

See all articles

Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods

Haydar Talib

Bite of the Paper Tiger

Why do organizations want to test their voice authentication solutions?

Functional vs probabilistic testing

领英推荐

What are we even testing?

10 Key Ideas on Deepfakes

266 位关注者

Haydar Talib的更多文章

社区洞察

其他会员也浏览了

Privacy and Data Protection Concerns in Utilizing AI Video Analytics

Your digital forensics, corporate investigations and cyber security update - CCL Fetch edition

Introduction to Cyber Forensic Tools

December Bot Spotlight

February Bot Spotlight

Is Your Business Inviting an AI-based Attack?

The Importance of Computer Forensics

Data Integrity and Identity Authentication

Big Data in Privacy and National Security.

Swift and massive data classification advances score a win for better securing sensitive information

Bite of the Paper Tiger

Why do organizations want to test their voice authentication solutions?

Functional vs probabilistic testing

领英推荐

What are we even testing?

10 Key Ideas on Deepfakes

266 位关注者

Haydar Talib的更多文章

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

Key Idea 6 - build synthetic speech (deepfake) technology responsibly

Fear and Loathing in Vancouver - NeurIPS in a Nutshell

tl;dr Five Ideas on Deepfakes in Five Minutes

Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

Key Idea 4 – build your deepfake countermeasures alongside voice authentication

Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

社区洞察

其他会员也浏览了

Privacy and Data Protection Concerns in Utilizing AI Video Analytics

Your digital forensics, corporate investigations and cyber security update - CCL Fetch edition

Introduction to Cyber Forensic Tools

December Bot Spotlight

February Bot Spotlight

Is Your Business Inviting an AI-based Attack?

The Importance of Computer Forensics

Data Integrity and Identity Authentication

Big Data in Privacy and National Security.

Swift and massive data classification advances score a win for better securing sensitive information