TORCHAUDIO
Dhanushkumar R
Microsoft Learn Student Ambassador - BETA|Data Scientist-Intern @BigTapp Analytics|Ex-Intern @IIT Kharagpur| Azurex2 |Machine Learning|Deep Learning|Data Science|Gen AI|Azure AI&Data |Technical Blogger
Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.
Loading audio data
To load audio data, you can use?torchaudio.load().This function accepts a path-like object or file-like object as input.The returned value is a tuple of waveform (Tensor) and sample rate (int).By default, the resulting tensor object has?dtype=torch.float32?and its value range is?[-1.0,?1.0]
waveform, sample_rate = torchaudio.load(SAMPLE_WAV)
Loading from file-like object
The I/O functions support file-like objects. This allows for fetching and decoding audio data from locations within and beyond the local file system. The following examples illustrate this.
# Load audio data as HTTP request
url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"
with requests.get(url, stream=True) as response:
waveform, sample_rate = torchaudio.load(response.raw)
Saving to file-like object
Similar to the other I/O functions, you can save audio to file-like objects. When saving to a file-like object, argument?format?is required.
waveform, sample_rate = torchaudio.load(SAMPLE_WAV)
# Saving to bytes buffer
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")
buffer_.seek(0)
print(buffer_.read(16))
It provides various transformations and functions that are useful for processing audio data. Let's dive into some commonly used transformations provided by torchaudio.transforms:
A spectrogram is a figure which represents the spectrum of frequencies of a recorded audio over time.
This means that as we get brighter in color in the figure, the sound is heavily concentrated around those specific frequencies, and as we get darker in color, the sound is close to empty/dead sound. This allows us to get a good understanding of the shape and structure of the audio without even listening to it!
n_fft: Number of FFT components (default: 400).
hop_length: Number of samples between successive frames (default: n_fft // 2).
win_length: Size of the window used for the FFT (default: n_fft).
power: Exponent for the magnitude spectrogram (default: None).
spectrogram_transform = torchaudio.transforms.Spectrogram()
spectrogram = spectrogram_transform(waveform)
Mel Spectrogram:
What is MEL SPECTROGRAM?
MelSpectrogram?applies a frequency-domain filter bank to audio signals that are windowed in time. You can get the center frequencies of the filters and the time instants corresponding to the analysis windows as the second and third output arguments from melSpectrogram .
?The Mel Scale is a logarithmic transformation of a signal’s frequency.?The core idea of this transformation is that sounds of equal distance on the Mel Scale are perceived to be of equal distance to humans.
Why MelSpectrogram?
Because the Mel scale closely mimics human perception, then it offers a good representation of the frequencies that humans typically hear. Also, a spectrogram is just the square of the magnitude spectrum of an audio signal
mel_spectrogram_transform = torchaudio.transforms.MelSpectrogram()
mel_spectrogram = mel_spectrogram_transform(waveform)
MFCC (Mel-frequency cepstral coefficients):
What is MFCC?
Mel-Frequency Cepstral Coefficients?(MFCC) is the most popular and dominant method to extract spectral features for speech by the use of perceptually based Mel spaced filter bank processing of the Fourier Transformed signal.The MFCC technique aims to develop the features from the audio signal which can be used for detecting the phones in the speech. But in the given audio signal there will be many phones, so we will break the audio signal into different segments with each segment having 25ms width and with the signal at 10ms.
领英推荐
On average a person speaks three words per second with 4 phones and each phone will have three states resulting in 36 states per second or 28ms per state which is close to our 25ms window.
From each segment, we will extract 39 features. Moreover, while breaking the signal, if we directly chop it off at the edges of the signal, the sudden fall in amplitude at the edges will produce noise in the high-frequency domain. So instead of a rectangular window, we will use Hamming/Hanning windows to chop the signal which won’t produce the noise in the high-frequency region.
mfcc_transform = torchaudio.transforms.MFCC()
mfcc = mfcc_transform(waveform)
How many MFCC features are there?
So overall MFCC technique will generate?39?features from each audio signal sample which are used as input for the speech recognition model
What are the advantages of MFCC?
The advantage of MFCC is that it is?good in error reduction and able to produce a robust feature when the signal is affected by noise. SVD/PCA technique is used to extract the important features out of the B-Distribution representation.
What are the disadvantages of MFCC?
The most notable downside of using MFCC is its?sensitivity to noise due to its dependence on the spectral form. Methods that utilize information in the periodicity of speech signals could be used to overcome this problem, although speech also contains a periodic content .
Resample:
One resampling application is the?conversion of digitized audio signals from one sample rate to another, such as from 48 kHz (the digital audio tape standard) to 44.1 kHz (the compact disc standard).
Resampling is a series of techniques used in statistics to gather more information about a sample. This can include retaking a sample or estimating its accuracy. With these additional techniques, resampling often?improves the overall accuracy and estimates any uncertainty within a population.
resample_transform = torchaudio.transforms.Resample(orig_freq=44100, new_freq=16000)
resampled_waveform = resample_transform(waveform)
AmplitudeToDB:
amplitude_to_db_transform = torchaudio.transforms.AmplitudeToDB(stft_db=True)
db_spectrogram = amplitude_to_db_transform(spectrogram)
STREAMREADER
Streaming API leverages the powerful I/O features of ffmpeg.
It can
Example:
StreamReader(src="sine=sample_rate=8000:frequency=360", format="lavfi")
STREAMWRITER
?torchaudio.io.StreamWriter?to encode and save audio/video data into various formats/destinations.
StreamWriter(dst="audio.wav")
StreamWriter(dst="audio.mp3")
# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)
RESAMPLING:
To resample an audio waveform from one freqeuncy to another, you can use?torchaudio.transforms.Resample?or?torchaudio.functional.resample().?transforms.Resample?precomputes and caches the kernel used for resampling, while?functional.resample?computes it on the fly, so using?torchaudio.transforms.Resample?will result in a speedup when resampling multiple waveforms using the same parameters?
CONCLUSION:
This article would cover some basic overview of Torch audio and torch transformation and streamreader and steamwriter