Machine Learning for Audio Domain - The Basics

Machine Learning for Audio Domain - The Basics

I've been working with ML on the Audio domain and at first I couldn't understand much, but as I kept reading and studying, I managed to figure out some things.

In this post, I'll try to share some of the basic theory with you. Mainly on what's sound, audio and a spectrogram.

What's Sound?

Sound is a vibration that propagates as an acoustic wave.

It has some properties:

  • Frequency
  • Amplitude
  • Speed
  • Direction

For us, Frequency and Amplitude are the important features.

An important aspect is that Sounds are a mixture of their component Sinusoidal waves (follow a sine curve) of different frequencies. From the equation below:

No alt text provided for this image
  • A is amplitude
  • f is frequency
  • t is time
No alt text provided for this image
No alt text provided for this image

Audio Signals are a representation of sound.

You get this data by taking samples of air pressure over time (sample rate). When we say 44.1kHz, it means we take 44100 samples per second. This results in a waveform

No alt text provided for this image

When you load a wav file, you get this waveform, or to be more precise, an array for N int16 numbers for each channel (mono vs stereo).

The code below shows how to load the audio file, get the sample rate that was used and the duration of the audio, too

Multiple ML models work at 16kHz sample rate.

No alt text provided for this image


Spectrograms

One common trick used to extract features or classify audio is to convert the waveform to a spectrogram. It can be used as a 2D image and use CNN layers to extract features from it. From here it would work as an image classification model

But, wait, what's a spectrogram?

A spectrogram is a visual representation of the spectrum of frequencies over time

No alt text provided for this image

But how do you go from the time domain (waveform) to the frequency domain?

Here is where one of the most famous math equations comes to the rescue:

-> The Fourier Transformation

No alt text provided for this image

It can convert from the time domain to the frequency domain. It also can, given a complex waveform, extract all the frequencies and amplitudes that form that waveform.

The code below can help you visualize and understand it better. I applied the Fourier transformation to the combined wave that was created earlier.

No alt text provided for this image
No alt text provided for this image

Applying FT to the waveform (loaded earlier) leads to 1000s of frequencies and these frequencies vary over time, as the audio changes (from silence to a dog to silence...).

No alt text provided for this image

To solve this, It's better to apply sequentially to parts (windows) of it. In this scenario, FT is quite complex and slow for practical use. That's why a Discrete version was derived: Fast Fourier Transformation (FFT).

No alt text provided for this image

Now we can create the spectrogram. We just need:

  • The y-axis is the frequency in Hz
  • The x-axis is the time
  • The color represents the magnitude or amplitude (the brighter the higher). Usually in decibels (dB)

Doing this by hand is not trivial. Here TensorFlow can help with some methods to create the spectrogram based on the waveform: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/audio/spectrogram

No alt text provided for this image
No alt text provided for this image

Summary

This is a very brief overview of how we go from a sound to a spectrogram. I hope this helps you understand some of the techniques that are  used for audio analysis.

I used multiple resources but this video is a very important one.

Notes of the formatting: if you need any of the sample code just leave a comment. since this is a crosspost from twitter, I keep code in images to make it easier to post but I know it's not optimal when you need to copy it.

要查看或添加评论,请登录

Gus Martins的更多文章

  • What you need to know to start understanding Audio as a Machine Learning developer

    What you need to know to start understanding Audio as a Machine Learning developer

    ??????????+ ???? = ?? I've been working with ML for the Audio domain for a while At first I couldn't understand much…

    3 条评论
  • What are Hyperparameters?

    What are Hyperparameters?

    One term that I learned when I started studying ML is Hyperparameter. What is it? When should I worry about it? Let me…

  • Python Puzzle 2 Solution

    Python Puzzle 2 Solution

    This was a nice fun puzzle and I loved the posted solutions as comments and some people were a little bit shy and sent…

  • Machine Learning Model Architecture

    Machine Learning Model Architecture

    This week, let's talk about model building a little. In the TensorFlow world, the simplest way of building a model is…

  • What is BERT?

    What is BERT?

    Encoding text in numbers is a very important part of NLP as the better this can be done, the better are the possible…

  • Language Embeddings

    Language Embeddings

    ML models are, in summary, just huge math equations that have numbers as inputs and numbers as outputs. If we want to…

  • Precision and Recall

    Precision and Recall

    When trying to understand a model quality, there are two important metrics - Precision and Recall To understand them…

  • Python puzzle 1 solution

    Python puzzle 1 solution

    Following up on the Python puzzle I posted a couple of days ago, let's talk about a solution First, thanks for everyone…

  • Simple audio recognition: Recognizing keywords

    Simple audio recognition: Recognizing keywords

    Let's go deeper into ML + Audio. How does Google Assistant, Alexa, Siri and other assistants work to understand what…

    5 条评论

社区洞察

其他会员也浏览了