登录查看更多内容

Machine Learning for Audio Domain - The Basics

Gus Martins

Gemma Product Manager @ Google DeepMind

发布日期: 2021年3月1日

I've been working with ML on the Audio domain and at first I couldn't understand much, but as I kept reading and studying, I managed to figure out some things.

In this post, I'll try to share some of the basic theory with you. Mainly on what's sound, audio and a spectrogram.

What's Sound?

Sound is a vibration that propagates as an acoustic wave.

It has some properties:

Frequency
Amplitude
Speed
Direction

For us, Frequency and Amplitude are the important features.

An important aspect is that Sounds are a mixture of their component Sinusoidal waves (follow a sine curve) of different frequencies. From the equation below:

A is amplitude
f is frequency
t is time

Audio Signals are a representation of sound.

You get this data by taking samples of air pressure over time (sample rate). When we say 44.1kHz, it means we take 44100 samples per second. This results in a waveform

When you load a wav file, you get this waveform, or to be more precise, an array for N int16 numbers for each channel (mono vs stereo).

The code below shows how to load the audio file, get the sample rate that was used and the duration of the audio, too

Multiple ML models work at 16kHz sample rate.

Spectrograms

One common trick used to extract features or classify audio is to convert the waveform to a spectrogram. It can be used as a 2D image and use CNN layers to extract features from it. From here it would work as an image classification model

But, wait, what's a spectrogram?

A spectrogram is a visual representation of the spectrum of frequencies over time

But how do you go from the time domain (waveform) to the frequency domain?

Here is where one of the most famous math equations comes to the rescue:

-> The Fourier Transformation

It can convert from the time domain to the frequency domain. It also can, given a complex waveform, extract all the frequencies and amplitudes that form that waveform.

The code below can help you visualize and understand it better. I applied the Fourier transformation to the combined wave that was created earlier.

Applying FT to the waveform (loaded earlier) leads to 1000s of frequencies and these frequencies vary over time, as the audio changes (from silence to a dog to silence...).

To solve this, It's better to apply sequentially to parts (windows) of it. In this scenario, FT is quite complex and slow for practical use. That's why a Discrete version was derived: Fast Fourier Transformation (FFT).

Now we can create the spectrogram. We just need:

The y-axis is the frequency in Hz
The x-axis is the time
The color represents the magnitude or amplitude (the brighter the higher). Usually in decibels (dB)

Doing this by hand is not trivial. Here TensorFlow can help with some methods to create the spectrogram based on the waveform: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/audio/spectrogram

Summary

This is a very brief overview of how we go from a sound to a spectrogram. I hope this helps you understand some of the techniques that are used for audio analysis.

I used multiple resources but this video is a very important one.

Notes of the formatting: if you need any of the sample code just leave a comment. since this is a crosspost from twitter, I keep code in images to make it easier to post but I know it's not optimal when you need to copy it.

要查看或添加评论，请登录

Gus Martins的更多文章

What you need to know to start understanding Audio as a Machine Learning developer

2023年3月8日

What you need to know to start understanding Audio as a Machine Learning developer

??????????+ ???? = ?? I've been working with ML for the Audio domain for a while At first I couldn't understand much…

3 条评论
What are Hyperparameters?

2021年6月24日

What are Hyperparameters?

One term that I learned when I started studying ML is Hyperparameter. What is it? When should I worry about it? Let me…
Python Puzzle 2 Solution

2021年4月13日

Python Puzzle 2 Solution

This was a nice fun puzzle and I loved the posted solutions as comments and some people were a little bit shy and sent…
Machine Learning Model Architecture

2021年4月5日

Machine Learning Model Architecture

This week, let's talk about model building a little. In the TensorFlow world, the simplest way of building a model is…
What is BERT?

2021年3月31日

What is BERT?

Encoding text in numbers is a very important part of NLP as the better this can be done, the better are the possible…
Language Embeddings

2021年3月30日

Language Embeddings

ML models are, in summary, just huge math equations that have numbers as inputs and numbers as outputs. If we want to…
Precision and Recall

2021年3月18日

Precision and Recall

When trying to understand a model quality, there are two important metrics - Precision and Recall To understand them…
Python puzzle 1 solution

2021年3月12日

Python puzzle 1 solution

Following up on the Python puzzle I posted a couple of days ago, let's talk about a solution First, thanks for everyone…
Simple audio recognition: Recognizing keywords

2021年3月5日

Simple audio recognition: Recognizing keywords

Let's go deeper into ML + Audio. How does Google Assistant, Alexa, Siri and other assistants work to understand what…

5 条评论

See all articles

Machine Learning for Audio Domain - The Basics

Gus Martins

Gemma Product Manager @ Google DeepMind

What's Sound?

Spectrograms

Summary

Gus Martins的更多文章

社区洞察

其他会员也浏览了

Voice & speech for the win

Leveraging the Power of AI in Music + Tech Careers

Can AI Be Truly Creative? Exploring Art and Music Generation

Wealth DNA Code Review: Alex Maxwell Audio Manifestation Program FAKE or WORTH THE MONEY?

Friday: An example of AI augmentation analysis & guitars

Creativity Is a Process

AI plays jazz (kind-of)

Generative AI: A Force for Good, Not?Fear

The Rise of AI in Creative Industries: How AI is Transforming Art, Music, and Entertainment

What's Sound?

Spectrograms

Summary

Gus Martins的更多文章

What you need to know to start understanding Audio as a Machine Learning developer

What are Hyperparameters?

Python Puzzle 2 Solution

Machine Learning Model Architecture

What is BERT?

Language Embeddings

Precision and Recall

Python puzzle 1 solution

Simple audio recognition: Recognizing keywords

社区洞察

其他会员也浏览了

Voice & speech for the win

Leveraging the Power of AI in Music + Tech Careers

Can AI Be Truly Creative? Exploring Art and Music Generation

Wealth DNA Code Review: Alex Maxwell Audio Manifestation Program FAKE or WORTH THE MONEY?

Friday: An example of AI augmentation analysis & guitars

Creativity Is a Process

AI plays jazz (kind-of)

Generative AI: A Force for Good, Not?Fear

The Rise of AI in Creative Industries: How AI is Transforming Art, Music, and Entertainment