AI voice generation is here. And I use it every day.
Last year, I contributed to the CBS News ’ 60 Minutes story on "Synthetic Media: How deepfakes could soon change our world.”
The producers state how “synthetic media, better known as deepfakes, could be a goldmine for filmmakers.”
Synthetic media - hyper-realistic video and audio recordings created by code and not cameras or microphones? - is incredibly relevant to today's business professionals and content creators. Every LinkedIn post you write could be read by your own voice, presented by your AI-avatar, and infinitely customized, translated, and personalized to your audience.?
We are all creators, storytellers, and communicators.????
The 60 Minutes piece does a great job at exploring how Synthesia and Metaphysic.ai approach synthetic video and visual special effects. And since audio usually gets only a sliver of the budget and attention - yet sound is half of the viewing experience - let’s spend some additional time on synthetic voice.
More specifically, let's discuss the state of the art in AI-generated voice, Descript 's Overdub.
What is Overdub?
Overdub allows anyone to create a model of their voice, with as little as a few minutes of training. You can craft entirely new sentences, or even correct your recordings, as simply as typing the text you want spoken in your voice.
Where did Overdub come from?
Voice cloning is based on text-to-speech research, which has been an active area of research since the 80s. And the earliest versions of voice interfaces like Apple’s Siri required hours and hours of recording and painstaking editing to create. In 2017, Lyrebird AI pioneered the first platform that enabled anyone to clone their own voice, using a simple web interface and a few minutes of their voice. It really made a major splash in the industry because it had never been done before, with such high quality and with such little effort. Descript acquired Lyrebird in 2019, and the founders - Kundan Kumar , Jose Sotelo , and Alexandre de Brébisson - and the original team and tech are part of Descript, as its research team.? The Lyrebird technology is now commercialized as Overdub.??
How does Overdub work?
Overdub works in two stages 1) learning the general pattern of human speech from thousands of speakers and then 2) learning the uniqueness about a specific voice with a few minutes of data. It uses deep learning, specifically generative models such as generative adversarial network - or “GANs”? - to generate hyper-realistic speech in? high resolution audio. And it sounds incredible.
领英推荐
What do I do with it?
A lot.
We’re fairly comfortable allowing computers, and AI, to do things that we just don’t want to do in content creation. Fix spelling mistakes, autocomplete a word, transcribe our interviews and production notes, remove backgrounds from images, you get the gist.
Now, think of a voice clone as your stunt double. Let’s say you are a voice actor. Sure, you could read a list of every BMW dealership location in North America. But is that really the best use of your time, talents, and vocal cord endurance? You can ask your voice clone to do just those segments - and concentrate your time and energy on the rest of the performance.?
In this podcast, the team from A Million Ads includes this real-world example:??
The most common “wow moment” for is usually when you are making changes to existing presentations, podcasts and videos. A voice clone allows you to make correcting your recordings as easy as typing.
Let’s say you made a big mistake - you mispronounced an important name or stated the wrong date. You would previously need to go back to a studio, setup all your equipment again, and record everything. Or an editor or producer would spend hours trying to find that 1 word or phrase to splice in from elsewhere in the archives - and it never would sound right. This technology allows you to just type your corrections.??
Another example is how Eloise Lynton , a producer at Pushkin Industries describes using Descript for their podcasts and audiobooks. They transcribe dozens of hours of interviews and archive material - and just pull the audio highlights by directly selecting the text. Then, they use Malcolm’s Overdub voice, which he shared with their editors, to insert narration for the episode - they hear how it all sounds before anyone even enters the studio. (“It’s kind of crazy to hear,” she said.)??
Media production is now entering a phase where if you can dream it, it can happen.
I use my voice clone every day - to narrate product demonstrations, to fix mistakes in my screen recordings, and to help me realize my thoughts faster than I can setup my microphone.
You don’t need an expensive studio, or years of training — you just need a laptop and keyboard with a backspace key. We now have AI that works with us to identify and correct our mistakes in our audio and video, rapidly improve our storytelling, make our material sound studio quality.
AI is removing the tedious work that often stands between an idea and its expression, so that creators can focus on telling their stories instead of tinkering with complicated tools.
I’ve seen evolutionary leaps in my career, with the move from editing and splicing tape to recording and editing on PCs. And then with PCs becoming so powerful that professionals could work in studio settings at home.
We saw advances in other parts of the production process - filming on smartphones and anyone being able to publish and distribute on social media, YouTube , and podcast hosts. But the process in between — editing, correcting mistakes, smoothing out rough parts, cutting unnecessary stuff — has remained incredibly difficult, except to the highly skilled or trained. Until now. AI stepped in to help us.??
This next wave is huge, because now everyone has access to tools the world has never seen before. The next generation of storytellers never had it so good.
ai starup - 运营经理
1 个月loos great
CEO KEYFAX NewMedia Inc. Partner with Alan Parsons at ASSR.
2 年I hope you're being ironic.
Audiologist and Audio Engineer with Tuned
2 年Like any advances in tech we need to also take responsibility for how it can be used to manipulate people. I'm sure many would like to hear your thoughts about this Jay LeBoeuf. Like Nina Schick says in CBS article you link to, "There are so many ethical, philosophical gray zones here that we really need to think about," and?"how do we do [deep fakes] without losing trust in all authentic media?" I'd never suggest to halt development of technology, especially if it means I'd never have to record ADR ever again ?? . But synthetic media is a realm that is like no other in terms of the potential to deceive.
Take responsibility. Give credit. Co-Founder, 2025 Artificial (Un)Intelligence Conference
2 年I can verify what Jay is saying about the AI speech cloning engine in Descript (happy customer, not affiliated with Descript). I created a replica of my voice with an hour and a half of samples from my podcasts. When I played back a segment to my wife that night without her knowing it was the Descript AI engine using my cloned voice, her response was, "This is better than your usual episodes. It's so clear!" Oh, well. I have played that clip for several people who know me well and the response has been consistent when I tell them it's not me; "No way, how did you do that?" That isn't to say the platform is perfect. There are timing issues where I have to consistently, manually adjust the pause times between sentences for more natural context flow, questions don't end with an upward inflection, and the text transcription isn't ready for prime time. That said, it's a remarkable platform for testing out AI generated content. Highly recommended. Mark