Key Idea 6 - build synthetic speech (deepfake) technology responsibly
We started this series by outlining how audio deepfakes have already defeated the human ear. By design.
This is confirmed by the commonly used MOS (Mean Opinion Score) family of metrics, which are calculated from human listener scores on how lifelike, or similar to the human counterparts, the deepfake voices sound.
AI safety, responsibility
This theme was present in a big way at this year's NeurIPS, and I took away two major lessons from what I saw:
So perhaps it's more important for each AI practitioner to adopt AI safety and responsibility as a personal philosophy for their own work. I think this would lead to much better outcomes than mechanically going through a checklist of "AI safety" items. I won't make any broad statements or pontifications here, but rather open up a conversation with anyone who develops generative technologies, particularly in the voice space where the goal is to artificially clone a specific person's voice.
That being said, we have three recommendations for researchers and technologists who create and deploy synthetic speech technology for legitimate use cases, who would not want their technology to be used for malicious purposes.
1. Don't use speaker embeddings as part of synthetic speech technology
It's one thing to create a synthetic voice that sounds lifelike, but there's an added dimension of complexity when you want that synthetic voice to sound like a specific person. Whether it's for custom text-to-speech (TTS) or voice conversion technology, the speaker embedding for the target person is used as part of the audio deepfake creation process. This is illustrated in the next two images.
Put simply, a speaker embedding is a compact representation of a person's voice, and is the same type of representation used by voice authentication systems.
Taking this further, several such systems are developed adversarially to voice authentication, using a neural network architecture known as the Generative Adversarial Network (GAN). The GAN concept involves two opposing, or "adversarial" networks: one (the generator) is trying to create the most realistic synthetic audio possible, and the other network (the discriminator) tries to tell real audio apart from synthetic audio. The networks are iterated in cyclical fashion until the entire feedback loop stabilizes. At that point you "throw away" the discriminator network, keeping only the generator.
The more the choice of speaker embedding technology resembles those used in voice authentication systems, the more problematic the deepfake voices become for those systems.
To help reduce malicious use of your technology, use alternative representations instead of speaker embeddings when developing custom TTS algorithms
The VALL-E series of custom TTS technology do not appear to use speaker embeddings, and are among the state-of-the-art in terms of realism and speaker similarity, to the human ear. Therefore this or similar approaches would be preferred to the GAN-based approach.
2. Speaker similarity is not a helpful metric; in fact it should be minimized
Way back in Key Idea 3, we showed how a modern voice authentication system could stop around 80% of deepfake attacks on its own. This means that deepfakes can only get past voice authentication 20% of the time.
In our very first Key Idea, we also saw how deepfake voices were completely true to life, and that the human ear was defeated 90% (or more) of the time. How can something that fools humans 90% of the time, only manage to fool voice authentication 20% of the time?
The answer is that while there is overlap between the two, what sounds real/good to the human ear is not necessarily enough to pass voice authentication, as illustrated below.
When developing synthetic speech (deepfake) technology towards improving MOS metrics, this will steer the technology towards convincing the human ear (see Venn diagram). There may still be overlap with the quality of audio sufficient to pass voice authentication, but this would be a byproduct of the technology, and not its goal.
Some papers (here, here and here) describing new synthetic speech systems have, however, started using voice authentication (aka speaker similarity) as an objective metric for judging the quality of new text-to-speech (TTS) models, to complement the MOS metrics.
From an AI safety standpoint, we will propose to avoid doing this in the future, at least for implementations that are deployed to the public.
Speaker similarity should not be set as a goal for deepfake technology development
The more often deepfake technologies are developed and evaluated using voice authentication, the more they become adversarial to voice authentication. Taken to an extreme, it would be akin to intentionally developing technology for the purpose of fooling voice authentication systems.
In fact, the extreme would be to develop deepfake technology completely adversarially to voice authentication, but this would only be motivated by malicious goals.
Instead, we propose that speaker similarity could be used as a sort of anti-objective when developing synthetic speech technology. We propose a challenge to anyone building such tech: develop a system that maximizes MOS metrics, but minimizes speaker similarity at the same time.
Challenge: speaker similarity should be an anti-objective during deepfake technology development
Referring to the above Venn diagram, this new approach would pull deepfake technology towards the "fools human ear" region, and simultaneously push it away from the "passes voice authentication" region. This would allow for lifelike voices to be generated while ensuring that voice authentication remains effective.
3. Implement a robust watermark for all synthesized speech generated by your technology
The notion of watermarking is one that has spread for all Generative AI development, and featured prominently at this year's NeurIPS.
The idea is simple: for any asset generated by AI, embed a human-imperceptible signature that could be recovered later by a watermark "reader," confirming the AI source of said asset (image, video, or audio).
Some recent work on developing watermarking techniques was featured at this year's Interspeech (here, here and here), but this is still a work in progress. The main challenge of adding watermarks to AI-generated data is to make them robust to manipulation (i.e. by a malicious actor). Nevertheless there has been much progress in watermark technology over the past couple of years, and probably we are close to having something usable, if not totally robust to tampering.
That being said, watermarking is only as useful as the extent to which it is applied.
In fact, as long as even one malicious or careless actor exists who develops and releases non-watermarked deepfake technology, then we have to continue developing voice authentication and deepfake technologies that can work in the absence of watermarks.
But adding a watermark to audio deepfakes would be a declaration of good intent by the technologists developing such tools for legitimate purposes.
Anyone developing deepfake technology should implement watermarks within it, to signal that their intent is for non-criminal use
Since watermarks are imperceptible, if you were serious about not having your technology misused by criminals, then why not add a watermark to it?
Adversarial approaches can be good for voice authentication research, but don't productize it!
We've now provided three guidelines that, if followed by researchers and technologists creating synthetic speech technology, would help to ensure that they are doing so in a safe and responsible manner from a voice authentication standpoint. This is, of course, not a comprehensive view.
That aside, those of us building the world's leading voice authentication and deepfake detection systems cannot assume that any of the above will be adhered to. We have to continue developing our technologies with the knowledge that malicious actors will do everything possible to bypass security measures.
Which is why we should break the rules ourselves.
Adversarial technology development can in fact be useful as a way of improving our voice authentication and deepfake detection solutions. And especially as a way to anticipate what malicious actors might try to do, ensuring our systems remain robust and secure.
Groups developing deepfake technology should partner with those developing voice authentication. Adversarial technology development can be a powerful way to ensure voice authentication is robust to deepfakes, but adversarial systems should not be made public
If you work on Generative AI technology and/or synthetic speech generation technology, we would love to hear your perspective. What do you think about the guidelines we shared here? Would you adopt these? Why not? Are we missing something? Let us know in the comments below!
tl;dr do not productize deepfake technology that was developed adversarially to voice authentication and/or deepfake detection systems. Only release TTS technology that includes a watermark, to signal yourself as a responsible player in the AI space
Authentic Leadership | Strategic Transformation | Financial Services | Risk Management | Customer Contact | Biometric Security
2 个月While we never quite know which future is ahead of us, there certainly are futures out there where we may choose to have avatars, digitised versions of ourself, act on our behalf. Accurate AND trusted voice technology could be sought after by consumers for the right reasons and monetised by providers. In that context, genuine providers who care about their brand and revenue streams should absolutely be motivated to embrace solid standards and best practices such as these. Maybe in time we go further than a watermark and devise a "kite mark" that (1) attests to the provider, (2) affirms adherence to a set of agreed global standards and (3) [who knows!?!] triangulates with an independent trusted token that confirms the client is happy that you use the audio. We may in time consider not only that the absence of a mark may be a bad thing, but that it's presence may even enhance your trust in the audio you receive, perhaps with more confidence than a plain old phone call!
The recommendations for generative voice AI practitioners emerged from discussions with Luis Buera and Héctor Delgado, and I want to also thank them for their contributions in writing this week's article.