Text to speech is getting quite a bit better

Text to speech is getting quite a bit better

Remember in the movie "2001: A Space Odyssey," where HAL 9000 sings the song "Daisy, Daisy" in a robotic voice as it is being taken apart?

That was a reference to a real event! In 1961, Bell Labs developed software for an IBM 704 computer to sing the same song. This wasn't just a recording -- it was using a vocoder to synthesize individual syllables to sing. This made headlines around the world as the first "talking computer" and still lives on in Kubrick's celebrated movie.

Since then, you've probably gotten used to computers being able to speak, at least a little. Robotic voices on call center lines or from home assistants have become pretty commonplace. They're understandable and have gotten gradually better over time. But no one would mistake them for real people.

But now, the same transformer-based generative AI techniques that lead to ChatGPT and Midjourney are leading to realistic human voices. They can match intonation, emotion and expression similar to a human. They can speak any language and pronounce uncommon words and names. They can mimic voices based on hearing only short samples Perhaps because AI is blowing up everywhere, new text-to-speech isn't getting so much attention.

Some of the new text-to-speech systems include:

Just for fun, I've been using Coqui to translate some long books into audiobooks. (I can't share them, because they're copyrighted.) It's a clunky process right now and takes a few days on my Mac to do a decent-sized book, but it works.

The opportunities to make life better for vision-impaired people, in particular, are tremendous.

What else can we make that computers can talk almost as well as humans?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了