TORCHAUDIO

TORCHAUDIO

Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.


Loading audio data

To load audio data, you can use?torchaudio.load().This function accepts a path-like object or file-like object as input.The returned value is a tuple of waveform (Tensor) and sample rate (int).By default, the resulting tensor object has?dtype=torch.float32?and its value range is?[-1.0,?1.0]


waveform, sample_rate = torchaudio.load(SAMPLE_WAV)
        

Loading from file-like object

The I/O functions support file-like objects. This allows for fetching and decoding audio data from locations within and beyond the local file system. The following examples illustrate this.


# Load audio data as HTTP request
url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"
with requests.get(url, stream=True) as response:
    waveform, sample_rate = torchaudio.load(response.raw)        

Saving to file-like object

Similar to the other I/O functions, you can save audio to file-like objects. When saving to a file-like object, argument?format?is required.


waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

# Saving to bytes buffer
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")

buffer_.seek(0)
print(buffer_.read(16))        

It provides various transformations and functions that are useful for processing audio data. Let's dive into some commonly used transformations provided by torchaudio.transforms:

  1. Spectrogram:

A spectrogram is a figure which represents the spectrum of frequencies of a recorded audio over time.

This means that as we get brighter in color in the figure, the sound is heavily concentrated around those specific frequencies, and as we get darker in color, the sound is close to empty/dead sound. This allows us to get a good understanding of the shape and structure of the audio without even listening to it!


  • torchaudio.transforms.Spectrogram computes the Short-Time Fourier Transform (STFT) of an audio waveform. The STFT represents the signal in both time and frequency domains.
  • Parameters:

n_fft: Number of FFT components (default: 400).

hop_length: Number of samples between successive frames (default: n_fft // 2).

win_length: Size of the window used for the FFT (default: n_fft).

power: Exponent for the magnitude spectrogram (default: None).


spectrogram_transform = torchaudio.transforms.Spectrogram()
spectrogram = spectrogram_transform(waveform)
        

Mel Spectrogram:

  • torchaudio.transforms.MelSpectrogram computes the Mel spectrogram of an audio waveform. It maps the STFT magnitude to the mel scale.
  • Parameters:n_fft, hop_length, win_length, n_mels: Similar to Spectrogram.

What is MEL SPECTROGRAM?

MelSpectrogram?applies a frequency-domain filter bank to audio signals that are windowed in time. You can get the center frequencies of the filters and the time instants corresponding to the analysis windows as the second and third output arguments from melSpectrogram .

?The Mel Scale is a logarithmic transformation of a signal’s frequency.?The core idea of this transformation is that sounds of equal distance on the Mel Scale are perceived to be of equal distance to humans.

Why MelSpectrogram?

Because the Mel scale closely mimics human perception, then it offers a good representation of the frequencies that humans typically hear. Also, a spectrogram is just the square of the magnitude spectrum of an audio signal


mel_spectrogram_transform = torchaudio.transforms.MelSpectrogram()
mel_spectrogram = mel_spectrogram_transform(waveform)

        
No alt text provided for this image

MFCC (Mel-frequency cepstral coefficients):

  • torchaudio.transforms.MFCC computes MFCCs from a waveform or spectrogram.
  • Parameters:Similar to MelSpectrogram, plus n_mfcc (number of MFCCs to compute).

What is MFCC?

Mel-Frequency Cepstral Coefficients?(MFCC) is the most popular and dominant method to extract spectral features for speech by the use of perceptually based Mel spaced filter bank processing of the Fourier Transformed signal.The MFCC technique aims to develop the features from the audio signal which can be used for detecting the phones in the speech. But in the given audio signal there will be many phones, so we will break the audio signal into different segments with each segment having 25ms width and with the signal at 10ms.

On average a person speaks three words per second with 4 phones and each phone will have three states resulting in 36 states per second or 28ms per state which is close to our 25ms window.

From each segment, we will extract 39 features. Moreover, while breaking the signal, if we directly chop it off at the edges of the signal, the sudden fall in amplitude at the edges will produce noise in the high-frequency domain. So instead of a rectangular window, we will use Hamming/Hanning windows to chop the signal which won’t produce the noise in the high-frequency region.


mfcc_transform = torchaudio.transforms.MFCC()
mfcc = mfcc_transform(waveform)

        


How many MFCC features are there?

So overall MFCC technique will generate?39?features from each audio signal sample which are used as input for the speech recognition model

What are the advantages of MFCC?

The advantage of MFCC is that it is?good in error reduction and able to produce a robust feature when the signal is affected by noise. SVD/PCA technique is used to extract the important features out of the B-Distribution representation.

What are the disadvantages of MFCC?

The most notable downside of using MFCC is its?sensitivity to noise due to its dependence on the spectral form. Methods that utilize information in the periodicity of speech signals could be used to overcome this problem, although speech also contains a periodic content .

Resample:

  • torchaudio.transforms.Resample changes the sample rate of an audio waveform.
  • Parameters:orig_freq, new_freq: Original and new sample rates.

One resampling application is the?conversion of digitized audio signals from one sample rate to another, such as from 48 kHz (the digital audio tape standard) to 44.1 kHz (the compact disc standard).

Resampling is a series of techniques used in statistics to gather more information about a sample. This can include retaking a sample or estimating its accuracy. With these additional techniques, resampling often?improves the overall accuracy and estimates any uncertainty within a population.


resample_transform = torchaudio.transforms.Resample(orig_freq=44100, new_freq=16000)
resampled_waveform = resample_transform(waveform)

        


AmplitudeToDB:

  • torchaudio.transforms.AmplitudeToDB converts amplitude spectrogram to decibel scale.
  • Parameters:stft_db: True to convert an STFT to decibel (default: False)


amplitude_to_db_transform = torchaudio.transforms.AmplitudeToDB(stft_db=True)
db_spectrogram = amplitude_to_db_transform(spectrogram)

        

STREAMREADER

Streaming API leverages the powerful I/O features of ffmpeg.

It can

  • Load audio/video in variety of formats
  • Load audio/video from local/remote source
  • Load audio/video from file-like object
  • Load audio/video from microphone, camera and screen
  • Generate synthetic audio/video signals.
  • Load audio/video chunk by chunk
  • Change the sample rate / frame rate, image size, on-the-fly
  • Apply filters and preprocessings

Example:


StreamReader(src="sine=sample_rate=8000:frequency=360", format="lavfi")        

STREAMWRITER

?torchaudio.io.StreamWriter?to encode and save audio/video data into various formats/destinations.


StreamWriter(dst="audio.wav")

StreamWriter(dst="audio.mp3")
        


# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)        
No alt text provided for this image

RESAMPLING:

To resample an audio waveform from one freqeuncy to another, you can use?torchaudio.transforms.Resample?or?torchaudio.functional.resample().?transforms.Resample?precomputes and caches the kernel used for resampling, while?functional.resample?computes it on the fly, so using?torchaudio.transforms.Resample?will result in a speedup when resampling multiple waveforms using the same parameters?

CONCLUSION:

This article would cover some basic overview of Torch audio and torch transformation and streamreader and steamwriter

要查看或添加评论,请登录

社区洞察

其他会员也浏览了