Can you spot deepfake audio?
Can you tell the difference between a human voice and an AI one? It’s not as easy as you might suspect. It only takes “about a minute to two minutes of a person’s voice” to create a convincing synthetic version, according to computer science professor Hany Farid .?
While text-to-speech and speech-to-speech innovations have widespread potential to improve our work and lives, they’ve also sparked concerns about voice cloning and robocall scams, as well as fears of campaign misinformation and malicious use in an election year.?
Generative AI is getting remarkably good at replicating human speech, but there are still a few telltale signs to help you recognize whether you’re listening to an AI voice or a real person.
1. Flat speaking tone
Emotion and sentiment are especially difficult to get right in AI generated audio. If a voice sounds awkwardly flat without pauses for breathing—or phrases don’t match up with the emotional delivery, like a sentence ending with an upwards lilt to imply a question that isn’t there—that’s a potential sign of deepfake audio.?
2. Slurred, unnatural speech
Deepfake audio is created by training a natural language processing model on sample recordings of a person’s speech. It’s like an extremely complex form of pattern matching: The more samples you use, the more closely an AI voice will resemble the person it’s meant to mimic. But this also means they can struggle with unusual or unique words that don’t appear in the samples. Slurred speech, mispronounced words, and awkward stumbling over phrases suggest you might be listening to AI generated speech.
3. Odd background noises
It’s never been easier to record clean and crisp audio, even if you’re recording from your phone. If you notice a lot of atypical background noise, like static or crackling noises, that’s a clue that it might be deepfake audio—especially if the speaker is somebody who would typically use professional recording equipment, like a creator or celebrity.
4. The ‘too good to be true’ rule
Above all, remember this: Even the best AI detection software can fail to tell the difference between real speech and synthetic audio. That’s why the Better Business Bureau recommends following a basic rule: “If it sounds too good to be true, it probably is.”
What’s Trending in Speech Tech?
3 Big Takeaways from the NAB Show
At this year’s NAB Show in Las Vegas, nothing was a bigger conversation topic than AI. After dozens of insightful conversations on the showroom floor, Rev CEO Jason Chicola came back to Rev’s Austin HQ with three takeaways about what’s exciting video distributors about AI.
1. AI captions are unlocking global reach?
High quality AI captions and subtitles are a gamechanger for reaching global audiences, particularly for video distributors who want to translate content into multiple languages. Rather than choosing between expensive translations or nothing, AI is helping them expand their audiences efficiently.
2. Audiences are ready for AI dubbing?
Without taking a side on the “subs vs. dubs” debate, it’s a fact that some audiences prefer dubbing. Based on what we heard at NAB Show, there’s a growing interest in using AI dubs to give those audiences what they want, particularly by ingesting subtitles into AI dubbing tools.
3. AI summaries can bulk up metadata
As more streaming services invest in their FAST (Free Ad-Supported TV) tiers, the metadata within a content catalog has grown even more valuable, both to target ads effectively and personalize what audiences are watching. The problem? Most metadata collection is inconsistent. By leveraging tools like AI summary to generate an abstract, content keywords, and a cast list, video distributors can improve and accelerate their metadata collection.
Quote of the Month
“Language is a way of breaking open the world.”
Who said it? (Hint: He’s a famous novelist.)
What’s your biggest question about speech tech?
Let us know by leaving a comment below—and we might even include it in an upcoming newsletter.