Machine Learning for Audio Domain - The Basics
I've been working with ML on the Audio domain and at first I couldn't understand much, but as I kept reading and studying, I managed to figure out some things.
In this post, I'll try to share some of the basic theory with you. Mainly on what's sound, audio and a spectrogram.
What's Sound?
Sound is a vibration that propagates as an acoustic wave.
It has some properties:
- Frequency
- Amplitude
- Speed
- Direction
For us, Frequency and Amplitude are the important features.
An important aspect is that Sounds are a mixture of their component Sinusoidal waves (follow a sine curve) of different frequencies. From the equation below:
- A is amplitude
- f is frequency
- t is time
Audio Signals are a representation of sound.
You get this data by taking samples of air pressure over time (sample rate). When we say 44.1kHz, it means we take 44100 samples per second. This results in a waveform
When you load a wav file, you get this waveform, or to be more precise, an array for N int16 numbers for each channel (mono vs stereo).
The code below shows how to load the audio file, get the sample rate that was used and the duration of the audio, too
Multiple ML models work at 16kHz sample rate.
Spectrograms
One common trick used to extract features or classify audio is to convert the waveform to a spectrogram. It can be used as a 2D image and use CNN layers to extract features from it. From here it would work as an image classification model
But, wait, what's a spectrogram?
A spectrogram is a visual representation of the spectrum of frequencies over time
But how do you go from the time domain (waveform) to the frequency domain?
Here is where one of the most famous math equations comes to the rescue:
It can convert from the time domain to the frequency domain. It also can, given a complex waveform, extract all the frequencies and amplitudes that form that waveform.
The code below can help you visualize and understand it better. I applied the Fourier transformation to the combined wave that was created earlier.
Applying FT to the waveform (loaded earlier) leads to 1000s of frequencies and these frequencies vary over time, as the audio changes (from silence to a dog to silence...).
To solve this, It's better to apply sequentially to parts (windows) of it. In this scenario, FT is quite complex and slow for practical use. That's why a Discrete version was derived: Fast Fourier Transformation (FFT).
Now we can create the spectrogram. We just need:
- The y-axis is the frequency in Hz
- The x-axis is the time
- The color represents the magnitude or amplitude (the brighter the higher). Usually in decibels (dB)
Doing this by hand is not trivial. Here TensorFlow can help with some methods to create the spectrogram based on the waveform: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/audio/spectrogram
Summary
This is a very brief overview of how we go from a sound to a spectrogram. I hope this helps you understand some of the techniques that are used for audio analysis.
I used multiple resources but this video is a very important one.
Notes of the formatting: if you need any of the sample code just leave a comment. since this is a crosspost from twitter, I keep code in images to make it easier to post but I know it's not optimal when you need to copy it.