Paper Note:Make-An-Audio
The model use STFT as the inter feature.A HIFIGAN is used as the vocoder to convert STFT to wav. CFG is used.
STFT is convert to z by an audio encoder.KL loss and GAN loss is used to optimize the audioencoder.The training data can be audios that do not have text label.This leverages the ability of self-supervised training.
Due to insufficient data, They thought of a good way: to construct new data by superimposing and splicing different samples.