登录查看更多内容

Key Idea 6 - build synthetic speech (deepfake) technology responsibly

Haydar Talib

发布日期: 2024年12月19日

We started this series by outlining how audio deepfakes have already defeated the human ear. By design.

This is confirmed by the commonly used MOS (Mean Opinion Score) family of metrics, which are calculated from human listener scores on how lifelike, or similar to the human counterparts, the deepfake voices sound.

AI safety, responsibility

This theme was present in a big way at this year's NeurIPS, and I took away two major lessons from what I saw:

what it means to build AI safely or responsibly is a work in progress
there is not one authority on AI safety and responsibility; everyone is responsible, everyone is fallible

So perhaps it's more important for each AI practitioner to adopt AI safety and responsibility as a personal philosophy for their own work. I think this would lead to much better outcomes than mechanically going through a checklist of "AI safety" items. I won't make any broad statements or pontifications here, but rather open up a conversation with anyone who develops generative technologies, particularly in the voice space where the goal is to artificially clone a specific person's voice.

That being said, we have three recommendations for researchers and technologists who create and deploy synthetic speech technology for legitimate use cases, who would not want their technology to be used for malicious purposes.

1. Don't use speaker embeddings as part of synthetic speech technology

It's one thing to create a synthetic voice that sounds lifelike, but there's an added dimension of complexity when you want that synthetic voice to sound like a specific person. Whether it's for custom text-to-speech (TTS) or voice conversion technology, the speaker embedding for the target person is used as part of the audio deepfake creation process. This is illustrated in the next two images.

High-level architecture of zero-shot text-to-speech (TTS) technology. The speaker embedding captures the target voice's characteristics for generating lifelike, but artificial audio. Source: ASVspoof 2024

High-level architecture of zero-shot voice conversion technology, which is similar to TTS; here, however, a person converts their own voice to sound like that of their target. Source: ASVspoof 2024

Put simply, a speaker embedding is a compact representation of a person's voice, and is the same type of representation used by voice authentication systems.

Taking this further, several such systems are developed adversarially to voice authentication, using a neural network architecture known as the Generative Adversarial Network (GAN). The GAN concept involves two opposing, or "adversarial" networks: one (the generator) is trying to create the most realistic synthetic audio possible, and the other network (the discriminator) tries to tell real audio apart from synthetic audio. The networks are iterated in cyclical fashion until the entire feedback loop stabilizes. At that point you "throw away" the discriminator network, keeping only the generator.

The more the choice of speaker embedding technology resembles those used in voice authentication systems, the more problematic the deepfake voices become for those systems.

To help reduce malicious use of your technology, use alternative representations instead of speaker embeddings when developing custom TTS algorithms

The VALL-E series of custom TTS technology do not appear to use speaker embeddings, and are among the state-of-the-art in terms of realism and speaker similarity, to the human ear. Therefore this or similar approaches would be preferred to the GAN-based approach.

2. Speaker similarity is not a helpful metric; in fact it should be minimized

Way back in Key Idea 3, we showed how a modern voice authentication system could stop around 80% of deepfake attacks on its own. This means that deepfakes can only get past voice authentication 20% of the time.

In our very first Key Idea, we also saw how deepfake voices were completely true to life, and that the human ear was defeated 90% (or more) of the time. How can something that fools humans 90% of the time, only manage to fool voice authentication 20% of the time?

The answer is that while there is overlap between the two, what sounds real/good to the human ear is not necessarily enough to pass voice authentication, as illustrated below.

Simple Venn diagram showing that while there is an overlap between what can convince the human ear as well as pass voice authentication, many audio deepfakes that sound real to the human ear would not pass voice authentication

When developing synthetic speech (deepfake) technology towards improving MOS metrics, this will steer the technology towards convincing the human ear (see Venn diagram). There may still be overlap with the quality of audio sufficient to pass voice authentication, but this would be a byproduct of the technology, and not its goal.

Some papers (here, here and here) describing new synthetic speech systems have, however, started using voice authentication (aka speaker similarity) as an objective metric for judging the quality of new text-to-speech (TTS) models, to complement the MOS metrics.

From an AI safety standpoint, we will propose to avoid doing this in the future, at least for implementations that are deployed to the public.

Speaker similarity should not be set as a goal for deepfake technology development

The more often deepfake technologies are developed and evaluated using voice authentication, the more they become adversarial to voice authentication. Taken to an extreme, it would be akin to intentionally developing technology for the purpose of fooling voice authentication systems.

In fact, the extreme would be to develop deepfake technology completely adversarially to voice authentication, but this would only be motivated by malicious goals.

Instead, we propose that speaker similarity could be used as a sort of anti-objective when developing synthetic speech technology. We propose a challenge to anyone building such tech: develop a system that maximizes MOS metrics, but minimizes speaker similarity at the same time.

Challenge: speaker similarity should be an anti-objective during deepfake technology development

Referring to the above Venn diagram, this new approach would pull deepfake technology towards the "fools human ear" region, and simultaneously push it away from the "passes voice authentication" region. This would allow for lifelike voices to be generated while ensuring that voice authentication remains effective.

3. Implement a robust watermark for all synthesized speech generated by your technology

The notion of watermarking is one that has spread for all Generative AI development, and featured prominently at this year's NeurIPS.

The idea is simple: for any asset generated by AI, embed a human-imperceptible signature that could be recovered later by a watermark "reader," confirming the AI source of said asset (image, video, or audio).

Some recent work on developing watermarking techniques was featured at this year's Interspeech (here, here and here), but this is still a work in progress. The main challenge of adding watermarks to AI-generated data is to make them robust to manipulation (i.e. by a malicious actor). Nevertheless there has been much progress in watermark technology over the past couple of years, and probably we are close to having something usable, if not totally robust to tampering.

That being said, watermarking is only as useful as the extent to which it is applied.

In fact, as long as even one malicious or careless actor exists who develops and releases non-watermarked deepfake technology, then we have to continue developing voice authentication and deepfake technologies that can work in the absence of watermarks.

But adding a watermark to audio deepfakes would be a declaration of good intent by the technologists developing such tools for legitimate purposes.

Anyone developing deepfake technology should implement watermarks within it, to signal that their intent is for non-criminal use

Since watermarks are imperceptible, if you were serious about not having your technology misused by criminals, then why not add a watermark to it?

Adversarial approaches can be good for voice authentication research, but don't productize it!

We've now provided three guidelines that, if followed by researchers and technologists creating synthetic speech technology, would help to ensure that they are doing so in a safe and responsible manner from a voice authentication standpoint. This is, of course, not a comprehensive view.

That aside, those of us building the world's leading voice authentication and deepfake detection systems cannot assume that any of the above will be adhered to. We have to continue developing our technologies with the knowledge that malicious actors will do everything possible to bypass security measures.

Which is why we should break the rules ourselves.

Adversarial technology development can in fact be useful as a way of improving our voice authentication and deepfake detection solutions. And especially as a way to anticipate what malicious actors might try to do, ensuring our systems remain robust and secure.

Groups developing deepfake technology should partner with those developing voice authentication. Adversarial technology development can be a powerful way to ensure voice authentication is robust to deepfakes, but adversarial systems should not be made public

If you work on Generative AI technology and/or synthetic speech generation technology, we would love to hear your perspective. What do you think about the guidelines we shared here? Would you adopt these? Why not? Are we missing something? Let us know in the comments below!

tl;dr do not productize deepfake technology that was developed adversarially to voice authentication and/or deepfake detection systems. Only release TTS technology that includes a watermark, to signal yourself as a responsible player in the AI space

10 Key Ideas on Deepfakes

266 位关注者

Cliff Mann

2 个月

While we never quite know which future is ahead of us, there certainly are futures out there where we may choose to have avatars, digitised versions of ourself, act on our behalf. Accurate AND trusted voice technology could be sought after by consumers for the right reasons and monetised by providers. In that context, genuine providers who care about their brand and revenue streams should absolutely be motivated to embrace solid standards and best practices such as these. Maybe in time we go further than a watermark and devise a "kite mark" that (1) attests to the provider, (2) affirms adherence to a set of agreed global standards and (3) [who knows!?!] triangulates with an independent trusted token that confirms the client is happy that you use the audio. We may in time consider not only that the absence of a mark may be a bad thing, but that it's presence may even enhance your trust in the audio you receive, perhaps with more confidence than a plain old phone call!

1 次回应

Haydar Talib

2 个月

The recommendations for generative voice AI practitioners emerged from discussions with Luis Buera and Héctor Delgado, and I want to also thank them for their contributions in writing this week's article.

1 次回应

查看更多评论

要查看或添加评论，请登录

Haydar Talib的更多文章

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

2025年2月12日

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

And now, finally, the full list of 10 Key Ideas on (audio) Deepfakes, through the prism of voice authentication…

9 条评论
Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

2025年1月31日

Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

Hopefully I can get away with a bit of humor in the heading of today's article, but make no mistake: I will (in some…

5 条评论
Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

2025年1月23日

Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

The title's a mouthful, but it's made up of several key points; let's unpack it. Last week my colleague Cliff Mann…

4 条评论
Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

2025年1月16日

Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

People still want to talk to people. It may seem an obvious statement to some, a mark of passé thinking to others.

3 条评论
Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods

2025年1月9日

Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods

I have invited Tim Savage and Simone Onizzi to share their insights and guidance for how organizations can approach the…

3 条评论
Fear and Loathing in Vancouver - NeurIPS in a Nutshell

2024年12月16日

Fear and Loathing in Vancouver - NeurIPS in a Nutshell

The first thing to note about #NeurIPS2024 is that there is so much of it. After a dizzying few days (and as I write…
tl;dr Five Ideas on Deepfakes in Five Minutes

2024年12月12日

tl;dr Five Ideas on Deepfakes in Five Minutes

If it's true that we should be concerned about the rise of deepfakes, then I have some bad news to share - 1. The human…
Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

2024年12月5日

Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

In Key Ideas 3 and 4, we laid the building blocks for the technological foundation that scientists and technologists…

1 条评论
Key Idea 4 – build your deepfake countermeasures alongside voice authentication

2024年11月27日

Key Idea 4 – build your deepfake countermeasures alongside voice authentication

Last week (Key Idea 3) I shared some data on voice authentication’s robustness to audio deepfakes, including how newer…

4 条评论
Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

2024年11月21日

Key Idea 3 – voice authentication is the first line of defense against audio deepfakes

In the world of voice biometrics authentication (I will interchangeably use voice authentication to mean the same…

3 条评论

See all articles

AI safety, responsibility

1. Don't use speaker embeddings as part of synthetic speech technology

2. Speaker similarity is not a helpful metric; in fact it should be minimized

3. Implement a robust watermark for all synthesized speech generated by your technology

Adversarial approaches can be good for voice authentication research, but don't productize it!

10 Key Ideas on Deepfakes

266 位关注者

Haydar Talib的更多文章

Fear & Loathing in Voice Authentication - the Complete Story on Deepfakes

Key Idea 10 - people should [REDACTED] privacy [REDACTED] consent to [REDACTED] AI

Key Idea 9 - your agentic AI will need to authenticate you when you talk to them

Key Idea 8 - the voice will remain a critical communications method between businesses and their customers

Key Idea 7 - organizations seeking to test voice authentication and deepfake detection must adopt scientific methods

Fear and Loathing in Vancouver - NeurIPS in a Nutshell

tl;dr Five Ideas on Deepfakes in Five Minutes

Key Idea 5 – the datasets used to develop deepfake countermeasures must be realistic, rich and abundant

Key Idea 4 – build your deepfake countermeasures alongside voice authentication

Key Idea 3 – voice authentication is the first line of defense against audio deepfakes