Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods
I have invited Tim Savage and Simone Onizzi to share their insights and guidance for how organizations can approach the testing of voice authentication and/or deepfake detection systems. Tim and Simone have worked with enterprise organizations and their security teams worldwide, advising them on these new practices for voice authentication systems in a post-generative-AI world.
Bite of the Paper Tiger
By Tim Savage and Simone Onizzi
Custom text-to-speech (TTS) technology, which can be used to create lifelike voice “clones” of people, is easily accessible by anyone, no specialized technical skills required.
Throughout this series we’ve talked about how custom TTS technologies can be abused to create deepfakes and perpetuate financial and other crimes. We’ve also talked about how voice authentication and deepfake detection technologies can help detect and resist such fraudulent behaviors. Further, the very same deepfake technologies can (and should) be used to develop voice authentication and deepfake detection in the first place.
But once we have our defense systems ready, how do we go about testing that they actually work?
Journalists have certainly made bypassing voice authentication systems seem pretty easy: sign up for one of the voice clone services, create an “AI” version of your own voice, and call your bank (WARNING: do not try this at home, as you may inadvertently invite security headaches into your life).
Quite the party trick, one that journalists have been repeating for a few years now; it sounds scary on paper, but such tactics are paper tigers and don’t accurately reflect reality. Worse, over indexing on the journalistic demonstrations may lead us astray as we aim to develop effective defenses against the actual fraud threats (including the use of deepfakes), instead exposing us more to their attacks.
Effectively testing the risk of voice deepfakes requires a much broader understanding of one’s business policies, operational realities and the underlying technology that empowers them.
Why do organizations want to test their voice authentication solutions?
Voice authentication software is tested by businesses who already use it, or are in the process of implementing such technology, for the same reason as they’d want to test any other new piece of software or physical asset: does my recent/upcoming purchase work as expected?
Project delivery managers need to ensure that the voice authentication solutions provide a green light to the right tester and a red light to the incorrect tester. Operation managers need to verify how the authentication decisions impact internal workflows and customer experiences. Executives need to ensure that due diligence has been carried out, validating that their recent purchase will deliver the expected value (think KPIs in terms of dollars, whether in reduced operational costs or increased security) in the right timeframe (ROI). These are all valid goals and concerns, ones that proper testing aims to address.
The rise of audio deepfake technology creates a whole new set of fears and adds another dimension to our voice authentication testing strategy. How do you even begin to structure deepfake testing? There are a multitude of deepfake options available, each of them configurable to varying degrees. As of the writing of this article, there is no universal test set or testing strategy in place for deepfakes.
Functional vs probabilistic testing
So how can you test if your protection against spoof attacks (deepfake or otherwise) is robust enough?
Even in the standard scenario (authenticating a person), voice authentication testing is inherently variable. Unlike traditional password-based systems where the input-output pairs are fixed and fully deterministic (you always expect output y when the input is x), voice authentication systems are built upon probabilistic, AI-based algorithms, and can vary significantly. The variability in outcomes is a byproduct of having to adapt to varying real-world factors, including changes in a speaker’s health, emotional state, physical environment, the device used to transmit a person’s voice, or even the day-to-day subtle variations in people’s voices.
The quality of the input to a voice authentication system can also vary dramatically. For instance, our system should be robust to a large amount of background noise while someone is speaking. Does this mean we still “pass” the person, when they’re barely audible? As we work through various scenarios, definitions quickly become murky. But it becomes clear that our testing framework would need to be expanded to cover multiple scenarios and represent a large array of variability, in order to be reliable.
These considerations extend to deepfake audio as well.
If voice authentication and deepfake detection technologies cannot be tested like deterministic software, then how do we proceed?
The answer has already been provided to us by the scientific method.
The framework for testing the accuracy of voice authentication and deepfake detection has already been provided by the scientific method for developing these technologies in the first place
In Key Idea 5 we covered the importance of the datasets used to create and develop voice authentication and deepfake detection technologies. Underpinning our recommendations are scientific principles to help us move towards statistical significance and reliability of AI systems.
As it turns out, we can use the very same principles to define our testing methodologies. For posterity, the 3 major considerations are -
领英推荐
A large volume of tests, performed by dozens (or hundreds) of participants, will help ensure statistical significance and improve your chances of meaningful interpretation of results
Realism in testing will help avoid certain types of bias that can infiltrate AI systems (by overfitting to the wrong data)
The journalistic demonstrations we’ve spoken about are a prime example: a system can be tuned to account for individuals “breaking in” to their own profiles using deepfakes, but this will likely lead to suboptimal performance when the real criminals finally decide to adopt such techniques to commit fraud.
What are we even testing?
So far we’ve spent much of our time talking about the technical (scientific, even) considerations when thinking of testing voice authentication and deepfake detection technologies. We started with that topic to assure readers that there is a path forward. However, the primary dimension for an organization when it comes to evaluating risk around its business and authentication solutions may not have much to do with technology after all.
When organizations are faced with new threats that entail complex and technical testing approaches, they sometimes reach out to their info sec colleagues for help. Red Team testing, as traditionally defined, will need to evolve in order to become meaningful in the context of deepfake voice testing. The methodology of penetration tests typically revolves around the following steps:
Let’s now apply the above Red Team framework to the case of authentication within a call center.
1. Methods used to authenticate callers: Knowledge-based questions, one-time-PIN and voice authentication
2. Exploitability of each authentication method:
3. Call into the contact center and access the protected account using the aforementioned exploitations
A Red Team tester synthesizing their own voice or that of a colleague to bypass voice authentication does confirm that the paper tiger has some teeth. Such a test, however, would be prone to the same limitations as we mentioned earlier from a statistical significance and bias standpoint. Beyond the science of such testing, however, it is also limited because:
Before we conclude, there is one other dimension for us to consider, perhaps not today, but certainly in a future not too far away: synthetic speech is not inherently fraudulent or malicious. Detecting synthetic speech in the context of customer interactions can increase operational costs, frustrate legitimate customer and increase fraud exposure.
Consider a customer who has set up a voice auto-attendant of their own voice, while waiting in a call queue with their bank. The first 5-10 seconds of the call with the bank would be a synthesized voice, saying something like “This is an AI agent calling on behalf of John Smith; please wait a moment while I get John to pick up.” When deepfake detection technology correctly flags the synthesized voice, what are the operational steps thereafter? Does a fraud analyst take the call (increase operational costs), do you force the customer to use knowledge-based authentication (increased fraud exposure) or do you deny service altogether (customer frustration)?
Voice authentication solutions should not be viewed as a wall where pass/fail are the only outcomes. All solutions are a highly configurable membrane where the size and quantity of what passes through is directly controlled by the business objectives. Demonstrating that a needle can be threaded through a fishing net (voice authentication) does not qualify or quantify how much fish (fraud) you are going to catch.
Fraudsters leveraging deepfake voice technology represent a very small portion of the voice authentication threat landscape. A fraudster’s goal when attacking a contact center is not to show off their audio manipulation prowess, but rather to gain access to an account as easily as possible. Traditional penetration testing exposes vulnerabilities in a system or network: in the context of voice authentication, we propose that the question should not mainly be if deepfake technology can be used to obtain privileged access, but more generally how can voice (or other method) be used to obtain privileged access in a given interaction channel (such as a phone call).
tl;dr similar to the requirements for developing voice authentication and deepfake detection effectively, you need realistic, rich and abundant data when testing the accuracy of these technologies. Thankfully, the scientific method already exists, and should be incorporated into testing methodologies
Authentic Leadership | Strategic Transformation | Financial Services | Risk Management | Customer Contact | Biometric Security
1 个月I'd also point out that it is human nature to assume that something is working (or not) based on your own experience of it. A restaurant is bad [or good] because you had an overcooked [perfect] steak, a bank is bad [or good] because the credit line they offered was lower [higher] than your current provider, a delivery firm is bad [or good] because the driver dropped your parcel on your cat [handed it to you personally]. The temptation can be similar when testing probabilistic solutions. Do you allow yourself to be "turned on a pin" because of an observation from a targeted test case? It is equally dangerous to take false comfort from a result that you like as it is to over inflate a worry from one that you don't. As an aside : I recall early in my career an angry executive demanding changes to retail credit policy, because their spouse didn't get the credit limit that they wanted. Exceptions and edge cases are great ways to inform workarounds, mitigants and exceptions - but it's no way to set policy! Suffice to say, we awarded the credit limit, but we didn't change the policy.
Authentic Leadership | Strategic Transformation | Financial Services | Risk Management | Customer Contact | Biometric Security
1 个月A great read! I love the needle & fish net analogy & use it regularly. I fully get the desire to test. As organisations face new threats & deploy novel solutions to mitigate them, they must gain buy in from many (often skeptical) actors: customers, colleagues, executives, regulators ... Model governance frameworks, in particular post financial crisis, almost compel us to test everything to destruction. The most valuable skill is the ability to design & execute meaningful test plans & to be able to comprehend the results that you generate in the context of your deployed solution. Your piece carries some sage advice! If I had just two key things for test planners to remember: A precisely designed, pin point, edge case test can be highly informative in illustrating the limitations of your solution - but is almost certainly not representative of the lions share of the use cases you will see ... or even where its value lies. Don't conflate the two. Particularly when dealing with solutions to mitigate novel threats, don't allow unrealistic expectations of perfection to be the enemy of "good". Furthermore, don't dare to assume that you know what perfect looks like - that is where complacency lies!
BONUS CONTENT: for me, the Holidays are usually a great time to get caught up on personal projects, sleep, hobbies, and READING (also a hobby, yes). Somewhat relevant to today's article, my book recommendation emerging from this latest Holiday season is "Everything is Predictable" by Tom Chivers: https://www.goodreads.com/book/show/199798096-everything-is-predictable A really nice, accessible intro to statistical thinking and methods, with some nice history to boot!