From A to Z How to Clone Any Sound with AI

From A to Z How to Clone Any Sound with AI

The changing technological conditions of our decline no longer surprise us. Although we are familiar with methods such as face cloning and voice cloning, this tutorial will explain how to do voice cloning simply without going into technical details step by step.

In this tutorial, we will clone our voice. Then we will make the texts we have prepared speak with our voice. First, there are many ways in which this practice can be ethically abused. It doesn't take a lot of creativity to think about how. But when evaluated in general, it will be seen that the good sides outweighed.

Before we get started, let's take a look at the concepts we may encounter regarding audio analysis and cloning.

1. Dataset

Datasets are indispensable for training our models. Generally speaking, there are two types of voice datasets, namely single-speaker, and multi-speaker.

Single-speaker datasets are datasets where only one person is a speaker. Models created with these datasets synthesize only that speaker's audio data.

Multi-speaker datasets contain multiple conversations. Therefore, the voice sets of different users from outside the dataset can be synthesized from the models created from these datasets. In this tutorial, we will use the multi-speaker dataset. Thus, we will synthesize our voice by training datasets created by different people.

Although audio datasets are not very common, there are many quality datasets published on this subject recently. The most used datasets are listed below.?

  • Librispeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
  • Common Voice?is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from many public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
  • Google Audioset is an expanding ontology of 635 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Google collected data from human labelers to probe the presence of specific audio classes in 10-second segments of YouTube videos.?
  • VoxCeleb contains around 100,000 phrases by 1,251 celebrities, extracted from YouTube videos, spanning a diverse range of accents, professions, and ages.?
  • CHIME is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
  • The AudioMNIST?dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers.
  • In the DEMoS dataset,?9365 emotional and 332 neutral samples were produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
  • In the Emotional Voice dataset,?2,519 speech samples were produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
  • The Flickr Audio Caption?dataset consists of 40,000 spoken captions of 8,000 natural images and is 4.2 GB in size.
  • The SEWA?dataset has more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
  • Spoken Wikipeida Corpora has 38 GB in size available in both audios and without audio format.
  • 2000 HUB5 English Evaluation Transcripts?was developed by the Linguistic Data Consortium (LDC) and consists of transcripts of 40 English telephone conversations.

2. Speaker Encoder?????

We use a speaker encoder to represent our speech as a representation plane where similar voices are close to each other and different voices are far away from each other. Shortly we take a voice sample and create a fixed dimensional vector that represents the characteristic of our voice sample. The significant advantage of this method is we digitize our voice sample to train.

3. Synthesizer

Synthesizer briefly creates a melspectrogram from the text. So we give the text to the model and the model converts it to a melspectrogram. Then we transmit these melspectrograms to the vocoder.

?4. Melspectrogram

The human ear hears low-frequency sounds better. That's why we're pulling the spectrogram measure to the mel measure. Humans can hear between 20-20000Hz. It is a technique that references people's ability to easily hear low frequencies. melspectrogram does grouping. It is one of the most used techniques for feature extracting.

No alt text provided for this image

In the above graph, the y-axis is logarithmically scaled. Therefore, we hear low frequencies more easily. As the melspectrogram graph shows, it carries less information at high frequencies and more information at low frequencies.

Mel-Frequency Cepstral Coefficients (MFCCs) can be used for feature extraction which works in the same logic as Melspectrogram. It is also frequently used in Chromagram, which extracts features according to the classification in the pitch.

5. Vocoder

In short, Vocoder converts the melspectrograms produced by the Synthesizer into audio files that we can listen to and understand again. Wavenet, which started to be used in 2016, is one of the most popular vocoders.

6.?Model Summary????

So far, let's combine the concepts and make a summary. First of all, our model will have 2 inputs. The first is our audio sample, and the second is the text for speech. The voice cloning models usually consists of 3 different parts. These:

1.Speaker encoder

2.Synthesizer

3.Vocoder

No alt text provided for this image

We create speaker embedding using the speaker encoder. Speaker embedding basically contains characteristic of our voice sample. Then synthesizer takes the speaker embedding and the text and create melspectrogram. Melspectrogram is our represantation of final speech. For listen this speech we need vocoder. Vocoder take the melspectrogram and create a waveform format out of it. So we can listen to this output.

7. Model Compiling

Step 1: Speech Generation

There are 3 different libraries that I can recommend for voice clones. These are listed below.

https://github.com/CorentinJ/Real-Time-Voice-Cloning

https://github.com/BenAAndrew/Voice-Cloning-App

https://github.com/neonbjb/tortoise-tts

The first two have interfaces. But since I have more successful results, we will use the 3rd option, Tortoise even though it does not have an interface. Tortoise is a text-to-speech (TTS) program that can mimic voices given 2-5 examples. It is composed of five separately-trained neural networks that are pipelined together to produce the final output. I recommend you take a look at it for a more detailed analysis.

Using an anaconda or miniconda will make our work easier. Let's download the tortoise to our locale. Using git commands will make our job easier.

git clone https://github.com/neonbjb/tortoise-tts.gi
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install
conda install -c conda-forge pysoundfilet        

After everething installed prepare your own voice data.

?Step 2 : Prepare Our Voice?

To clone our voice, we will first need to take our voice recording. We must have at least 3 sound recordings of about 5-10 seconds. The optimal number is five. When recording a sound, there should be no other sounds such as music in the background, only our voice should be recorded. Likewise, creating a clear and fluent recording will increase the performance of the model. Simple and free audacity can be used for sound recording and manipulations on the recording.

Records taken should be saved in wav format with a sampling rate of 22050. The encoding format should be a 32-bit float. The sampling rate is the number of samples taken in 1 second. On average, this value is 16KHz. The top means better quality, the bottom means poorer quality.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Let's collect the voice samples that we prepared in a folder and place them in the relevant directory.

\tortoise-tts\tortoise\voices\my_voice_samples        

After everything is ready pass the text to speech:

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast        

We only used the 'text', 'sound', and 'preset' parameters. Different results can be obtained with different parameter values.

text: Text to speak. Default value: The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.

voice: Select the voice to use for generation. Default value: random

preset: Which voice preset to use. Default value: fast. For other choices: ultra_fast : Produces speech fastest.

fast: Decent quality speech at a decent inference rate. A good choice for mass inference.

standard: Very good quality. This is generally about as good as you are going to get.

high_quality: Use if you want the absolute best. This is not worth the compute, though.

candidates: How many output candidates to produce per voice. Default value: 3

output_path: Where to store outputs default /results

According to the results, my accent has vanished in the generated speech and is a little bit different between samples even though the margin is quite small. But in general, speech fluidity is very good and the differences between voice samples are very small.

?This tutorial is for English. For different languages, we need to retrain our model.?But we will need a good dataset (at least ~300 hours, high quality, and with transcripts) in the language of our choice.

Thanks for reading. I hope you learned something. For any issue regarding installation/use or any other related issues, please contact me.

Ajayi Adebayo

Web Developer, Digital Marketer, SEO Expert and Online Business Coach

1 年

This was really useful. Thank you

回复

要查看或添加评论,请登录

Serkan Erdonmez的更多文章

  • NEW ERA FOR THE AI: TRANSFORMERS

    NEW ERA FOR THE AI: TRANSFORMERS

    Transformers mark a groundbreaking shift in AI technology, heralding a new era of innovation. They represent a…

    4 条评论
  • OCR COMPARISONS - TESSERACT, EAST, AND KERAS OCR

    OCR COMPARISONS - TESSERACT, EAST, AND KERAS OCR

    OCR (Optical Character Recognition) is a technology that enables the extraction of text and characters from scanned…

    2 条评论
  • Distributed Training And TPU Usage On TensorFlow

    Distributed Training And TPU Usage On TensorFlow

    Distributed Training TensorFlow is one of the most popular machine learning frameworks used today. One of the key…

    8 条评论
  • A Daily Life Example to Human Pose Estimation

    A Daily Life Example to Human Pose Estimation

    While working at a desk, sometimes when our focus shifts to the subject of work, we inevitably lose our healthy upright…

  • Belirli Anahtar Kelimeler Yard?m?yla Otomasyon Olarak Akademik Literatür Taramas?

    Belirli Anahtar Kelimeler Yard?m?yla Otomasyon Olarak Akademik Literatür Taramas?

    Günümüzde teknolojinin geli?mesi, e?itim ve ??retim sürecine dair farkl? tekniklerin kullan?lmas?na ?n ayak olmu?tur…

  • YüZ HATLARI üZER?NE ??LEMLER

    YüZ HATLARI üZER?NE ??LEMLER

    Bu yaz?mda Python ile bir insan?n yüz hatlar?n? ??karmay? ve bu durumun fonksiyonel kullan?m alanlar?n? aktaraca??m…

  • RENK SKALALARI ve STANDARTLARI

    RENK SKALALARI ve STANDARTLARI

    G?rüntü i?leme ve tasar?m a?amalar?nda kullan?lan g?rseller belirli renk kodlar?n? kullan?rlar. Bunlardan belli…

  • AI APPLICATIONS TO INCREASE NAVIGATION SAFETY

    AI APPLICATIONS TO INCREASE NAVIGATION SAFETY

    Today we will consider the basic problem of classification. Can the type and navigational purpose of the ships be…

  • Extracting Data From Whatsapp

    Extracting Data From Whatsapp

    Hi everyone. Today we will extract our data from Whatsapp and make a ‘word cloud’ from the most used words.

  • Funny Sides of Regex

    Funny Sides of Regex

    Last week we had regex lessons as a bonus. To intensify my theoric knowledge about regex, I studied over our last…

社区洞察

其他会员也浏览了