From A to Z How to Clone Any Sound with AI
The changing technological conditions of our decline no longer surprise us. Although we are familiar with methods such as face cloning and voice cloning, this tutorial will explain how to do voice cloning simply without going into technical details step by step.
In this tutorial, we will clone our voice. Then we will make the texts we have prepared speak with our voice. First, there are many ways in which this practice can be ethically abused. It doesn't take a lot of creativity to think about how. But when evaluated in general, it will be seen that the good sides outweighed.
Before we get started, let's take a look at the concepts we may encounter regarding audio analysis and cloning.
1. Dataset
Datasets are indispensable for training our models. Generally speaking, there are two types of voice datasets, namely single-speaker, and multi-speaker.
Single-speaker datasets are datasets where only one person is a speaker. Models created with these datasets synthesize only that speaker's audio data.
Multi-speaker datasets contain multiple conversations. Therefore, the voice sets of different users from outside the dataset can be synthesized from the models created from these datasets. In this tutorial, we will use the multi-speaker dataset. Thus, we will synthesize our voice by training datasets created by different people.
Although audio datasets are not very common, there are many quality datasets published on this subject recently. The most used datasets are listed below.?
2. Speaker Encoder?????
We use a speaker encoder to represent our speech as a representation plane where similar voices are close to each other and different voices are far away from each other. Shortly we take a voice sample and create a fixed dimensional vector that represents the characteristic of our voice sample. The significant advantage of this method is we digitize our voice sample to train.
3. Synthesizer
Synthesizer briefly creates a melspectrogram from the text. So we give the text to the model and the model converts it to a melspectrogram. Then we transmit these melspectrograms to the vocoder.
?4. Melspectrogram
The human ear hears low-frequency sounds better. That's why we're pulling the spectrogram measure to the mel measure. Humans can hear between 20-20000Hz. It is a technique that references people's ability to easily hear low frequencies. melspectrogram does grouping. It is one of the most used techniques for feature extracting.
In the above graph, the y-axis is logarithmically scaled. Therefore, we hear low frequencies more easily. As the melspectrogram graph shows, it carries less information at high frequencies and more information at low frequencies.
Mel-Frequency Cepstral Coefficients (MFCCs) can be used for feature extraction which works in the same logic as Melspectrogram. It is also frequently used in Chromagram, which extracts features according to the classification in the pitch.
5. Vocoder
In short, Vocoder converts the melspectrograms produced by the Synthesizer into audio files that we can listen to and understand again. Wavenet, which started to be used in 2016, is one of the most popular vocoders.
6.?Model Summary????
So far, let's combine the concepts and make a summary. First of all, our model will have 2 inputs. The first is our audio sample, and the second is the text for speech. The voice cloning models usually consists of 3 different parts. These:
1.Speaker encoder
2.Synthesizer
3.Vocoder
We create speaker embedding using the speaker encoder. Speaker embedding basically contains characteristic of our voice sample. Then synthesizer takes the speaker embedding and the text and create melspectrogram. Melspectrogram is our represantation of final speech. For listen this speech we need vocoder. Vocoder take the melspectrogram and create a waveform format out of it. So we can listen to this output.
7. Model Compiling
Step 1: Speech Generation
There are 3 different libraries that I can recommend for voice clones. These are listed below.
领英推荐
The first two have interfaces. But since I have more successful results, we will use the 3rd option, Tortoise even though it does not have an interface. Tortoise is a text-to-speech (TTS) program that can mimic voices given 2-5 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. I recommend you take a look at it for a more detailed analysis.
Using an anaconda or miniconda will make our work easier. Let's download the tortoise to our locale. Using git commands will make our job easier.
git clone https://github.com/neonbjb/tortoise-tts.gi
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install
conda install -c conda-forge pysoundfilet
After everething installed prepare your own voice data.
?Step 2 : Prepare Our Voice?
To clone our voice, we will first need to take our voice recording. We must have at least 3 sound recordings of about 5-10 seconds. The optimal number is five. When recording a sound, there should be no other sounds such as music in the background, only our voice should be recorded. Likewise, creating a clear and fluent recording will increase the performance of the model. Simple and free audacity can be used for sound recording and manipulations on the recording.
Records taken should be saved in wav format with a sampling rate of 22050. The encoding format should be a 32-bit float. The sampling rate is the number of samples taken in 1 second. On average, this value is 16KHz. The top means better quality, the bottom means poorer quality.
Let's collect the voice samples that we prepared in a folder and place them in the relevant directory.
\tortoise-tts\tortoise\voices\my_voice_samples
After everything is ready pass the text to speech:
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
We only used the 'text', 'sound', and 'preset' parameters. Different results can be obtained with different parameter values.
text: Text to speak. Default value: The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.
voice: Select the voice to use for generation. Default value: random
preset: Which voice preset to use. Default value: fast. For other choices: ultra_fast : Produces speech fastest.
fast: Decent quality speech at a decent inference rate. A good choice for mass inference.
standard: Very good quality. This is generally about as good as you are going to get.
high_quality: Use if you want the absolute best. This is not worth the compute, though.
candidates: How many output candidates to produce per voice. Default value: 3
output_path: Where to store outputs default /results
According to the results, my accent has vanished in the generated speech and is a little bit different between samples even though the margin is quite small. But in general, speech fluidity is very good and the differences between voice samples are very small.
?This tutorial is for English. For different languages, we need to retrain our model.?But we will need a good dataset (at least ~300 hours, high quality, and with transcripts) in the language of our choice.
Thanks for reading. I hope you learned something. For any issue regarding installation/use or any other related issues, please contact me.
Web Developer, Digital Marketer, SEO Expert and Online Business Coach
1 年This was really useful. Thank you