登录查看更多内容

From A to Z How to Clone Any Sound with AI

Serkan Erdonmez

AI Researcher For The Maritime Industry | PhD Candidate

发布日期: 2023年3月23日

The changing technological conditions of our decline no longer surprise us. Although we are familiar with methods such as face cloning and voice cloning, this tutorial will explain how to do voice cloning simply without going into technical details step by step.

In this tutorial, we will clone our voice. Then we will make the texts we have prepared speak with our voice. First, there are many ways in which this practice can be ethically abused. It doesn't take a lot of creativity to think about how. But when evaluated in general, it will be seen that the good sides outweighed.

Before we get started, let's take a look at the concepts we may encounter regarding audio analysis and cloning.

1. Dataset

Datasets are indispensable for training our models. Generally speaking, there are two types of voice datasets, namely single-speaker, and multi-speaker.

Single-speaker datasets are datasets where only one person is a speaker. Models created with these datasets synthesize only that speaker's audio data.

Multi-speaker datasets contain multiple conversations. Therefore, the voice sets of different users from outside the dataset can be synthesized from the models created from these datasets. In this tutorial, we will use the multi-speaker dataset. Thus, we will synthesize our voice by training datasets created by different people.

Although audio datasets are not very common, there are many quality datasets published on this subject recently. The most used datasets are listed below.?

Librispeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
Common Voice?is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from many public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
Google Audioset is an expanding ontology of 635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Google collected data from human labelers to probe the presence of specific audio classes in 10-second segments of YouTube videos.?
VoxCeleb contains around 100,000 phrases by 1,251 celebrities, extracted from YouTube videos, spanning a diverse range of accents, professions, and ages.?
CHIME is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
The AudioMNIST?dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers.
In the DEMoS dataset,?9365 emotional and 332 neutral samples were produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
In the Emotional Voice dataset,?2,519 speech samples were produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
The Flickr Audio Caption?dataset consists of 40,000 spoken captions of 8,000 natural images and is 4.2 GB in size.
The SEWA?dataset has more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
Spoken Wikipeida Corpora has 38 GB in size available in both audios and without audio format.
2000 HUB5 English Evaluation Transcripts?was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English telephone conversations.

2. Speaker Encoder?????

We use a speaker encoder to represent our speech as a representation plane where similar voices are close to each other and different voices are far away from each other. Shortly we take a voice sample and create a fixed dimensional vector that represents the characteristic of our voice sample. The significant advantage of this method is we digitize our voice sample to train.

3. Synthesizer

Synthesizer briefly creates a melspectrogram from the text. So we give the text to the model and the model converts it to a melspectrogram. Then we transmit these melspectrograms to the vocoder.

?4. Melspectrogram

The human ear hears low-frequency sounds better. That's why we're pulling the spectrogram measure to the mel measure. Humans can hear between 20-20000Hz. It is a technique that references people's ability to easily hear low frequencies. melspectrogram does grouping. It is one of the most used techniques for feature extracting.

In the above graph, the y-axis is logarithmically scaled. Therefore, we hear low frequencies more easily. As the melspectrogram graph shows, it carries less information at high frequencies and more information at low frequencies.

Mel-Frequency Cepstral Coefficients (MFCCs) can be used for feature extraction which works in the same logic as Melspectrogram. It is also frequently used in Chromagram, which extracts features according to the classification in the pitch.

5. Vocoder

In short, Vocoder converts the melspectrograms produced by the Synthesizer into audio files that we can listen to and understand again. Wavenet, which started to be used in 2016, is one of the most popular vocoders.

6.?Model Summary????

So far, let's combine the concepts and make a summary. First of all, our model will have 2 inputs. The first is our audio sample, and the second is the text for speech. The voice cloning models usually consists of 3 different parts. These:

1.Speaker encoder

2.Synthesizer

3.Vocoder

We create speaker embedding using the speaker encoder. Speaker embedding basically contains characteristic of our voice sample. Then synthesizer takes the speaker embedding and the text and create melspectrogram. Melspectrogram is our represantation of final speech. For listen this speech we need vocoder. Vocoder take the melspectrogram and create a waveform format out of it. So we can listen to this output.

7. Model Compiling

Step 1: Speech Generation

There are 3 different libraries that I can recommend for voice clones. These are listed below.

领英推荐

TAI #119 New LLM audio capabilities with NotebookLM…

Towards AI 5 个月前

Time to redefine 'fair use'?

TNW 1 年前

Mastering Speech Emotion Recognition for Market…

Rudder Analytics 8 个月前

https://github.com/CorentinJ/Real-Time-Voice-Cloning

https://github.com/BenAAndrew/Voice-Cloning-App

https://github.com/neonbjb/tortoise-tts

The first two have interfaces. But since I have more successful results, we will use the 3rd option, Tortoise even though it does not have an interface. Tortoise is a text-to-speech (TTS) program that can mimic voices given 2-5 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. I recommend you take a look at it for a more detailed analysis.

Using an anaconda or miniconda will make our work easier. Let's download the tortoise to our locale. Using git commands will make our job easier.

git clone https://github.com/neonbjb/tortoise-tts.gi
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install
conda install -c conda-forge pysoundfilet

After everething installed prepare your own voice data.

?Step 2 : Prepare Our Voice?

To clone our voice, we will first need to take our voice recording. We must have at least 3 sound recordings of about 5-10 seconds. The optimal number is five. When recording a sound, there should be no other sounds such as music in the background, only our voice should be recorded. Likewise, creating a clear and fluent recording will increase the performance of the model. Simple and free audacity can be used for sound recording and manipulations on the recording.

Records taken should be saved in wav format with a sampling rate of 22050. The encoding format should be a 32-bit float. The sampling rate is the number of samples taken in 1 second. On average, this value is 16KHz. The top means better quality, the bottom means poorer quality.

Let's collect the voice samples that we prepared in a folder and place them in the relevant directory.

\tortoise-tts\tortoise\voices\my_voice_samples

After everything is ready pass the text to speech:

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

We only used the 'text', 'sound', and 'preset' parameters. Different results can be obtained with different parameter values.

text: Text to speak. Default value: The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.

voice: Select the voice to use for generation. Default value: random

preset: Which voice preset to use. Default value: fast. For other choices: ultra_fast : Produces speech fastest.

fast: Decent quality speech at a decent inference rate. A good choice for mass inference.

standard: Very good quality. This is generally about as good as you are going to get.

high_quality: Use if you want the absolute best. This is not worth the compute, though.

candidates: How many output candidates to produce per voice. Default value: 3

output_path: Where to store outputs default /results

According to the results, my accent has vanished in the generated speech and is a little bit different between samples even though the margin is quite small. But in general, speech fluidity is very good and the differences between voice samples are very small.

?This tutorial is for English. For different languages, we need to retrain our model.?But we will need a good dataset (at least ~300 hours, high quality, and with transcripts) in the language of our choice.

Thanks for reading. I hope you learned something. For any issue regarding installation/use or any other related issues, please contact me.

Ajayi Adebayo

Web Developer, Digital Marketer, SEO Expert and Online Business Coach

1 年

This was really useful. Thank you

要查看或添加评论，请登录

Serkan Erdonmez的更多文章

NEW ERA FOR THE AI: TRANSFORMERS

2024年3月30日

NEW ERA FOR THE AI: TRANSFORMERS

Transformers mark a groundbreaking shift in AI technology, heralding a new era of innovation. They represent a…

4 条评论
OCR COMPARISONS - TESSERACT, EAST, AND KERAS OCR

2023年5月27日

OCR COMPARISONS - TESSERACT, EAST, AND KERAS OCR

OCR (Optical Character Recognition) is a technology that enables the extraction of text and characters from scanned…

2 条评论
Distributed Training And TPU Usage On TensorFlow

2023年4月8日

Distributed Training And TPU Usage On TensorFlow

Distributed Training TensorFlow is one of the most popular machine learning frameworks used today. One of the key…

8 条评论
A Daily Life Example to Human Pose Estimation

2023年2月20日

A Daily Life Example to Human Pose Estimation

While working at a desk, sometimes when our focus shifts to the subject of work, we inevitably lose our healthy upright…
Belirli Anahtar Kelimeler Yard?m?yla Otomasyon Olarak Akademik Literatür Taramas?

2023年1月2日

Belirli Anahtar Kelimeler Yard?m?yla Otomasyon Olarak Akademik Literatür Taramas?

Günümüzde teknolojinin geli?mesi, e?itim ve ??retim sürecine dair farkl? tekniklerin kullan?lmas?na ?n ayak olmu?tur…
YüZ HATLARI üZER?NE ??LEMLER

2022年5月19日

YüZ HATLARI üZER?NE ??LEMLER

Bu yaz?mda Python ile bir insan?n yüz hatlar?n? ??karmay? ve bu durumun fonksiyonel kullan?m alanlar?n? aktaraca??m…
RENK SKALALARI ve STANDARTLARI

2022年5月6日

RENK SKALALARI ve STANDARTLARI

G?rüntü i?leme ve tasar?m a?amalar?nda kullan?lan g?rseller belirli renk kodlar?n? kullan?rlar. Bunlardan belli…
AI APPLICATIONS TO INCREASE NAVIGATION SAFETY

2022年4月13日

AI APPLICATIONS TO INCREASE NAVIGATION SAFETY

Today we will consider the basic problem of classification. Can the type and navigational purpose of the ships be…
Extracting Data From Whatsapp

2022年4月9日

Extracting Data From Whatsapp

Hi everyone. Today we will extract our data from Whatsapp and make a ‘word cloud’ from the most used words.
Funny Sides of Regex

2022年1月12日

Funny Sides of Regex

Last week we had regex lessons as a bonus. To intensify my theoric knowledge about regex, I studied over our last…

See all articles

From A to Z How to Clone Any Sound with AI

Serkan Erdonmez

AI Researcher For The Maritime Industry | PhD Candidate

1. Dataset

2. Speaker Encoder?????

3. Synthesizer

?4. Melspectrogram

5. Vocoder

6.?Model Summary????

7. Model Compiling

领英推荐

Serkan Erdonmez的更多文章

社区洞察

其他会员也浏览了

AI Overview: Your Weekly AI Briefing

Text to Speech vs. Speech to Text: What’s the difference?

How to not get fooled by AI audio deepfakes

Listen for yourself- ElevenLabs's new AI tool generates sound effects using prompts

Professor Wenwu Wang Makes Waves in Language-Audio AI at International Workshops

AI, Say it out Loud!

Innovation Wave: Google's Veo Launch, ChatGPT's 300M Users and Music AI

Automated Speech Recognition Approaches And Challenges

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

How Data Annotation is used for Speech Recognition

1. Dataset

2. Speaker Encoder?????

3. Synthesizer

?4. Melspectrogram

5. Vocoder

6.?Model Summary????

7. Model Compiling

领英推荐

Serkan Erdonmez的更多文章

NEW ERA FOR THE AI: TRANSFORMERS

OCR COMPARISONS - TESSERACT, EAST, AND KERAS OCR

Distributed Training And TPU Usage On TensorFlow

A Daily Life Example to Human Pose Estimation

Belirli Anahtar Kelimeler Yard?m?yla Otomasyon Olarak Akademik Literatür Taramas?

YüZ HATLARI üZER?NE ??LEMLER

RENK SKALALARI ve STANDARTLARI

AI APPLICATIONS TO INCREASE NAVIGATION SAFETY

Extracting Data From Whatsapp

Funny Sides of Regex

社区洞察

其他会员也浏览了

AI Overview: Your Weekly AI Briefing

Text to Speech vs. Speech to Text: What’s the difference?

How to not get fooled by AI audio deepfakes

Listen for yourself- ElevenLabs's new AI tool generates sound effects using prompts

Professor Wenwu Wang Makes Waves in Language-Audio AI at International Workshops

AI, Say it out Loud!

Innovation Wave: Google's Veo Launch, ChatGPT's 300M Users and Music AI

Automated Speech Recognition Approaches And Challenges

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

How Data Annotation is used for Speech Recognition