July 2024 | This Month in Generative AI: Moving Through the Uncanny Valley (Pt. 2 of 2)
Content Authenticity Initiative
Authentic storytelling through open standards and Content Credentials.
By Hany Farid, UC Berkeley Professor, CAI Advisor
News and trends shaping our understanding of generative AI technology and its applications.
Last month I discussed how AI-generated images are passing through?the uncanny valley. This month I'll discuss AI-generated voices and where they are in their journey from the creepy, robot-like voices of a few years ago to today’s more realistic outputs.
A prototypical text-to-speech system consists of two basic parts. First, a text is specified and typically converted into a phonetic and prosodic representation that captures the specific sounds, intonation, stress, and rhythm to be spoken. Second, a synthesis engine converts this symbolic representation into a raw audio waveform, typically through an intermediate frequency-based representation.
Synthesized voices have come a long way. Boosted by advances in machine learning, today's synthetic voices are increasingly more realistic. Perhaps most impressive is that voices can be cloned even if they weren't used during the training of the generative-AI system, and with as little as 30 seconds of a voice recording.
In collaboration with?Sarah Barrington, a Ph.D. student at UC Berkeley, we have launched a new study to determine just how realistic these AI-generated voices are and whether a cloned voice sounds like the original speaker's voice.?
In this study, participants listen to a set of voices (one at a time), half of which are real and half of which are AI-generated. Although we are still collecting data, we have completed a pilot study with 50 participants who each listened to a total of 40 short voice recordings. The average accuracy on this task was 65%, slightly better than chance performance of 50%. There was only a small bias in which accuracy for real voices was 68% and accuracy for fake voices was 62%. In other words, participants were slightly more likely to say a recording was real. You can?test yourself on a set of 16 voices to see how you do.
We also asked participants how they thought they were distinguishing the real from the fake. We received some interesting insights, including:
While these preliminary results suggest that AI-generated voices are passing through the uncanny valley, they do not mean that all AI-generated voices are indistinguishable from reality. The snippets of voices that participants heard were relatively short, between 3 and 10 seconds, and did not feature yelling, laughing, or anything that reflected strong emotions. If, however, generative AI continues along its current trajectory, it seems likely that sooner or later it is going to be very difficult to perceptually distinguish the real from the fake.
At the same time, AI-generated videos are still on the other side of the uncanny valley. For example, Runway ML recently dropped?Gen-3 Alpha, their latest text-to-video generation model. Although at first glance the videos are pretty impressive and some of the?short-term temporal consistency problems have been eliminated, there are problems with longer-term temporal consistency. For example, over a 10-second clip, the body shape of the man in pink changes dramatically (and sunglasses magically appear), and on the right the woman's race changes in the span of a few seconds midway through the video.
As with AI-generated images, creators can add?Content Credentials to AI-generated audio files to make them easier to identify. One popular voice cloning service,?Respeecher, has already?implemented Content Credentials to help mitigate the weaponization of AI-generated voices. Other popular services like?ElevenLabs offer classifiers that can determine whether a recording was created by their generative engines. And, of course, we and the broader digital forensic research community continue to develop the next generation of forensic tools for automatically detecting AI-generated voices.?
The combination of credential-based and forensic-based solutions promises to mitigate the threats posed by generative AI. But since these solutions can’t eliminate the threats, consumer awareness and vigilance remain critical.
Consider joining the movement to restore trust and transparency online.