登录查看更多内容

TORCHAUDIO

Dhanushkumar R

Microsoft Learn Student Ambassador - BETA|Data Scientist-Intern @BigTapp Analytics|Ex-Intern @IIT Kharagpur| Azurex2 |Machine Learning|Deep Learning|Data Science|Gen AI|Azure AI&Data |Technical Blogger

发布日期: 2023年8月10日

Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.

Loading audio data

To load audio data, you can use?torchaudio.load().This function accepts a path-like object or file-like object as input.The returned value is a tuple of waveform (Tensor) and sample rate (int).By default, the resulting tensor object has?dtype=torch.float32?and its value range is?[-1.0,?1.0]

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

Loading from file-like object

The I/O functions support file-like objects. This allows for fetching and decoding audio data from locations within and beyond the local file system. The following examples illustrate this.

# Load audio data as HTTP request
url = "https://download.pytorch.org/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"
with requests.get(url, stream=True) as response:
    waveform, sample_rate = torchaudio.load(response.raw)

Saving to file-like object

Similar to the other I/O functions, you can save audio to file-like objects. When saving to a file-like object, argument?format?is required.

waveform, sample_rate = torchaudio.load(SAMPLE_WAV)

# Saving to bytes buffer
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")

buffer_.seek(0)
print(buffer_.read(16))

It provides various transformations and functions that are useful for processing audio data. Let's dive into some commonly used transformations provided by torchaudio.transforms:

Spectrogram:

A spectrogram is a figure which represents the spectrum of frequencies of a recorded audio over time.

This means that as we get brighter in color in the figure, the sound is heavily concentrated around those specific frequencies, and as we get darker in color, the sound is close to empty/dead sound. This allows us to get a good understanding of the shape and structure of the audio without even listening to it!

torchaudio.transforms.Spectrogram computes the Short-Time Fourier Transform (STFT) of an audio waveform. The STFT represents the signal in both time and frequency domains.
Parameters:

n_fft: Number of FFT components (default: 400).

hop_length: Number of samples between successive frames (default: n_fft // 2).

win_length: Size of the window used for the FFT (default: n_fft).

power: Exponent for the magnitude spectrogram (default: None).

spectrogram_transform = torchaudio.transforms.Spectrogram()
spectrogram = spectrogram_transform(waveform)

Mel Spectrogram:

torchaudio.transforms.MelSpectrogram computes the Mel spectrogram of an audio waveform. It maps the STFT magnitude to the mel scale.
Parameters:n_fft, hop_length, win_length, n_mels: Similar to Spectrogram.

What is MEL SPECTROGRAM?

MelSpectrogram?applies a frequency-domain filter bank to audio signals that are windowed in time. You can get the center frequencies of the filters and the time instants corresponding to the analysis windows as the second and third output arguments from melSpectrogram .

?The Mel Scale is a logarithmic transformation of a signal’s frequency.?The core idea of this transformation is that sounds of equal distance on the Mel Scale are perceived to be of equal distance to humans.

Why MelSpectrogram?

Because the Mel scale closely mimics human perception, then it offers a good representation of the frequencies that humans typically hear. Also, a spectrogram is just the square of the magnitude spectrum of an audio signal

mel_spectrogram_transform = torchaudio.transforms.MelSpectrogram()
mel_spectrogram = mel_spectrogram_transform(waveform)

MFCC (Mel-frequency cepstral coefficients):

torchaudio.transforms.MFCC computes MFCCs from a waveform or spectrogram.
Parameters:Similar to MelSpectrogram, plus n_mfcc (number of MFCCs to compute).

What is MFCC?

Mel-Frequency Cepstral Coefficients?(MFCC) is the most popular and dominant method to extract spectral features for speech by the use of perceptually based Mel spaced filter bank processing of the Fourier Transformed signal.The MFCC technique aims to develop the features from the audio signal which can be used for detecting the phones in the speech. But in the given audio signal there will be many phones, so we will break the audio signal into different segments with each segment having 25ms width and with the signal at 10ms.

Free Online Courses 1 年前

Major software libraries for physics-informed machine…

Holger Marschall 1 个月前

Artificial Intelligence #127

Andriy Burkov 2 年前

On average a person speaks three words per second with 4 phones and each phone will have three states resulting in 36 states per second or 28ms per state which is close to our 25ms window.

From each segment, we will extract 39 features. Moreover, while breaking the signal, if we directly chop it off at the edges of the signal, the sudden fall in amplitude at the edges will produce noise in the high-frequency domain. So instead of a rectangular window, we will use Hamming/Hanning windows to chop the signal which won’t produce the noise in the high-frequency region.

mfcc_transform = torchaudio.transforms.MFCC()
mfcc = mfcc_transform(waveform)

How many MFCC features are there?

So overall MFCC technique will generate?39?features from each audio signal sample which are used as input for the speech recognition model

What are the advantages of MFCC?

The advantage of MFCC is that it is?good in error reduction and able to produce a robust feature when the signal is affected by noise. SVD/PCA technique is used to extract the important features out of the B-Distribution representation.

What are the disadvantages of MFCC?

The most notable downside of using MFCC is its?sensitivity to noise due to its dependence on the spectral form. Methods that utilize information in the periodicity of speech signals could be used to overcome this problem, although speech also contains a periodic content .

Resample:

torchaudio.transforms.Resample changes the sample rate of an audio waveform.
Parameters:orig_freq, new_freq: Original and new sample rates.

One resampling application is the?conversion of digitized audio signals from one sample rate to another, such as from 48 kHz (the digital audio tape standard) to 44.1 kHz (the compact disc standard).

Resampling is a series of techniques used in statistics to gather more information about a sample. This can include retaking a sample or estimating its accuracy. With these additional techniques, resampling often?improves the overall accuracy and estimates any uncertainty within a population.

resample_transform = torchaudio.transforms.Resample(orig_freq=44100, new_freq=16000)
resampled_waveform = resample_transform(waveform)

AmplitudeToDB:

torchaudio.transforms.AmplitudeToDB converts amplitude spectrogram to decibel scale.
Parameters:stft_db: True to convert an STFT to decibel (default: False)

amplitude_to_db_transform = torchaudio.transforms.AmplitudeToDB(stft_db=True)
db_spectrogram = amplitude_to_db_transform(spectrogram)

STREAMREADER

Streaming API leverages the powerful I/O features of ffmpeg.

It can

Load audio/video in variety of formats
Load audio/video from local/remote source
Load audio/video from file-like object
Load audio/video from microphone, camera and screen
Generate synthetic audio/video signals.
Load audio/video chunk by chunk
Change the sample rate / frame rate, image size, on-the-fly
Apply filters and preprocessings

Example:

StreamReader(src="sine=sample_rate=8000:frequency=360", format="lavfi")

STREAMWRITER

?torchaudio.io.StreamWriter?to encode and save audio/video data into various formats/destinations.

StreamWriter(dst="audio.wav")

StreamWriter(dst="audio.mp3")

# In-memory encoding
buffer = io.BytesIO()
StreamWriter(dst=buffer)

RESAMPLING:

To resample an audio waveform from one freqeuncy to another, you can use?torchaudio.transforms.Resample?or?torchaudio.functional.resample().?transforms.Resample?precomputes and caches the kernel used for resampling, while?functional.resample?computes it on the fly, so using?torchaudio.transforms.Resample?will result in a speedup when resampling multiple waveforms using the same parameters?

CONCLUSION:

This article would cover some basic overview of Torch audio and torch transformation and streamreader and steamwriter

TORCHAUDIO

Dhanushkumar R

Microsoft Learn Student Ambassador - BETA|Data Scientist-Intern @BigTapp Analytics|Ex-Intern @IIT Kharagpur| Azurex2 |Machine Learning|Deep Learning|Data Science|Gen AI|Azure AI&Data |Technical Blogger

Loading audio data

Loading from file-like object

Saving to file-like object

领英推荐

STREAMREADER

STREAMWRITER

更多精彩文章

社区洞察

其他会员也浏览了

Saturday with Math (Sep 21st )

Everything You Should Know About Modern Algorithms

My New Machine Learning Book on Stochastic Processes

Regex Engines: History and Contributions

The TensorFlow Transcendence: Tracing the Nebulous Pathways of Machine Intelligence

YOLO-NAS: 7 Factors to Success

Artificial Intelligence #6

My version of the Turing test for Large Language Models: Write "Shame of the Sun"

AI Framework for Beginners: TensorFlow

Neuro-Symbolic AI on Virtual Geometry Compute Platforms

Loading audio data

Loading from file-like object

Saving to file-like object

领英推荐

STREAMREADER

STREAMWRITER

MLOPS -Getting Started .....

2024年6月18日

Pydub

2023年9月4日

Introduction to Python libraries for image processing(Opencv):

2023年9月2日

@tf.function

2023年8月21日

TEXT-TO-SPEECH Using Pyttsx3

2023年8月14日

Web Scraping

2023年8月11日

Getting Started With Hugging Face-Installation and setUp

2023年8月7日

Audio Features of ML

2023年8月6日

Learning Path: "Voice and Sound Recognition"

2023年8月6日

Pytorch Learning -3 [TRANSFORMS]

2023年8月5日

社区洞察

其他会员也浏览了

Saturday with Math (Sep 21st )

Everything You Should Know About Modern Algorithms

My New Machine Learning Book on Stochastic Processes

Regex Engines: History and Contributions

The TensorFlow Transcendence: Tracing the Nebulous Pathways of Machine Intelligence

YOLO-NAS: 7 Factors to Success

Artificial Intelligence #6

My version of the Turing test for Large Language Models: Write "Shame of the Sun"

AI Framework for Beginners: TensorFlow

Neuro-Symbolic AI on Virtual Geometry Compute Platforms