登录查看更多内容

Teaching a Machine to Learn Command Voices

Santiago Reyes Chávez

GCID AIOps Engineer (SRE) at SAP | Master's Student in Applied Artificial Intelligence at Tecnológico de Monterrey | Innovation and Development Engineer | Specialization in Renewable Energy and Advanced Data Analytics

发布日期: 2024年5月9日

Note: This article is a restructure of one of my assignments submitted as part of my studies at Tsinghua University. I would like to acknowledge the institution for providing the data and the opportunity to work on this project.

Introduction

The great opportunity that Big Data offers lies in the various approaches in the industry to work with different types of data, as well as different architectures. Within the data variety, this article will focus on images and how to process them to predict categories.

In this task, the goal is to classify 24 voice commands, which in a productive/real environment we can observe devices that already have such technology, such as Alexa, Siri, or other voice recognitions. The approach to correctly classify the different voice commands is by converting the audio into images and subsequently developing a Convolutional Neural Network (CNN).

Fig 1: 9 Spectrograms classified by numbers for their respective voice command.

A CNN is an architecture designed to process images, composed of layers of neurons that extract features from the images while simultaneously learning from them. A convolution, as the name suggests in CNNs, is the data input (in this case, pixels of the image) that applies a filter to the image, detecting relevant features in local regions of the image. The filters chosen in this article were of size 2x2, allowing them to identify edges, shapes, and textures in the spectrograms produced.

The process is accompanied by pooling layers, activation layers, and dense layers (fully connected), which help in reducing dimensionality, removing linearity from the model, and enabling the final classification of the spectrogram.

Approach to Solve the Problem

After receiving the audio transformed into spectrograms, the provided information consists of a training set and a test set. The information was structured to have the images in subfolders labeled from 0 to 23 in both datasets to have a complete structure.?

Exploratory Data Analysis (EDA):

Since it's unstructured data, an exploratory data analysis (EDA) was conducted, segmented into various sections:

Validate that there are enough or at least a similar number of images for each voice command in both the training and test sets.
Conduct a study of the image sizes, ensuring they maintain the same height and width.
Visualize the images to determine relevant transformations.

Approach to Transform Data:

When processing images, our original image values are 800 pixels wide by 513 pixels tall. Processing images at these dimensions would be very demanding when applying different filters in the neural network. Therefore, I've chosen to reduce the dimensions of the images to one-third to facilitate processing.

On the other hand, three approaches to image transformation have been adopted, meaning changing the saturation level, brightness, contrast, etc.

As an important note, these types of transformations are applied to images in both the training, validation, and test sets since they are general transformers, with the hope that the model can better learn from them.

Approach 1:

In this approach, we aim to slightly edit the images to seek improvement.

Decrease brightness by 180%.
Increase contrast by 100%.

Approach 2:

In this approach, the goal is to regularly modify the image.

Decrease brightness by 180%.
Decrease contrast by 40%.
Convert to grayscale "at the expense of the output not being gray."

Approach 3:

In this approach, the goal is to more aggressively modify the image.

Increase brightness by 40%.
Increase contrast by 110%.
Convert to grayscale "at the expense of the output not being gray."

Data Engineering:

In this approach, data augmentation was decided upon to mitigate the overfitting issue that the model primarily encountered. This was done by randomly selecting images and duplicating them while changing certain aspects such as rotations, zoom, translation, and random changes in contrast and brightness. The aim was to feed the model with a greater variety of images so that it could learn their features and avoid overfitting.

Data Augmentation Proposition 1:

Include simple Keras layers where the following actions are performed:

Random horizontal flips of the image.
Slightly rotate the image.
Randomize a zoom on the image to simulate greater volume in the spectrogram.

These changes can be observed in the following image, which corresponds to just one image.

Data Augmentation Proposition 2:

Include simple Keras layers where the following actions are performed:

Random horizontal flips of the image.
Slightly rotate the image.
Randomize a zoom on the image to simulate greater volume in the spectrogram.
Random Contrast to the image by 60%

These changes can be observed in the following image, which corresponds to just one image.

First Model( Basic Keras Model )

Data separation:

The datasets were split into test and validation sets (20% of the training set) to run a simple two-stage model. First, the model was trained on untransformed data, and then with a transformation on the training and validation sets. Finally, it was applied to the test set.

Results Simple model 1:?

In this model number 1, we considered 4 convolutional layers with max pooling, and finally, 2 dense layers were contemplated where all activation functions are "ReLu" and the output is the number of existing classes.

Different combinations were tested in the first model following the training and validation partition. The goal was to find a model that could achieve around 60% accuracy to deploy it in the test set.

Attempt #1?

As observed from both graphs, the model is overfitting on the training data, meaning it struggles to predict variables it hasn't seen. However, having an accuracy of 40% is not bad for predicting 24 categories of voice commands, but there is certainly room for improvement.

First Attempt Combination:

Training and validation data without transformation.
No data augmentation.

In this model, as well as in the following ones, 20 epochs were used to train the neural network. Later in the document, the reason for keeping few epochs will be expanded upon, and it's due to the computational processing level of the CNN.

Attempt #2

As can be observed in this model, the model has been completely overfit, although an accuracy of around 45% can be observed in the model.

This could be due to either a weak model or a lack of information for training. It is important to take into account that this is a simple model.

Second Attempt Combination:

Transformation 1
No data augmentation.

Second Model ( Basic Keras Model )

In this case, it consists of the same CNN neural network structure as before. However, the only difference is that in the first layer, the first data augmentation layer is added to the neural network.

Similarly, the goal of this section is to find a model that can fit and achieve a general value greater than 50% in validation to be deployed.

Data Separation:

领英推荐

Image Analysis in Machine Learning: How It Works and…

Machine Learning 1 Limited 6 个月前

Computer Vision training

Bluechip Technologies Asia 1 年前

Reviews of Papers on Geometric Learning - 2024

Patrick Nicolas 1 个月前

Basically, the same information cutoff will be used until finding the model whose highest iteration is greater than 50%.

Attempt #1

In this case, it can be observed that the model has failed to capture information relevant enough to achieve at least 10% accuracy in train and even less in validation.

First Attempt Combination:

Transformation 2
First Data augmentation

On this matter, points can be deducted from transformation number 2 for not bringing the necessary information for the CNN.

Attempt #2

Similarly to the first attempt, the model fails to meet any expectations of finding sufficient features to create relationships and make predictions.

Second Attempt Combination:

Transformation 3
First Data augmentation

Finally, since two iterations have been completed, I can conclude with the iterations of model number 2 and begin with the next iterations with model number 3, which will be given a much more complex structure.?}

Third Model ( CNN KERAS MODEL)

What makes this model special is its complexity, given that it's equipped with 4 convolutional layers specifying a 3x3 pixel filter, where a stride of 2 pixels has been defined, which can decrease the training process while potentially mapping the spectrogram much more precisely.

Finally, it passes through normalization layers for each layer, both for scaling and convolution, where there's a pooling layer to decrease the dimensionality created by the convolutions. Similarly to the previous models, two dense layers are used, but there are normalization and dropout layers to eliminate linearity from the model, so the last layer accounts for 24 classes.

Another significant difference is the final activation function, "softmax," which allows for easier and more accurate category predictions.

Similarly, the data separation remains the same structure, with the goal of finding the first model that mostly achieves around 50% accuracy. Once found, the model will be trained and sent to test with all the training information.

Attempt #1

In this first attempt, a drastic change can be observed between the second and first models, where the loss is significantly reduced, and the accuracy approaches the training process. However, it still doesn't meet the expectations of achieving 50% accuracy in validation.

First Attempt Combination:

Transformation 0
Second Data augmentation

Different combinations of transformations will be tested to seek better performance.

Attempt #2

In this model, the goal of achieving a 50% accuracy has been successfully met, in fact, the model has reached a peak of 61%, exceeding expectations. However, the model has overfit and experienced a deterioration in epoch number 14, where it became overfitted.

Second Attempt Combination:

Transformation 2
Second Data augmentation

Different combinations of transformations will be tested to seek better performance.

Attempt #3

Similarly, the goal has been met, especially with an improvement in maintaining around 60% accuracy after epoch number 10. This indicates that we have been able to generate a model with a combination that better fits unseen information, meaning the information is consumed more relevantly.

Third Attempt Combination:

Transformation 3
Second Data augmentation

Different combinations of transformations will be tested to seek better performance.

Attempt #4

In the fourth attempt, the number of epochs for the model was increased in order to see if there was any increase in accuracy after attempt number 20. However, there was no such increase.

Fourth Attempt Combination:

Transformation 2
Second Data augmentation

As demonstrated throughout the attempts, the best image transformation has been number 3, as it adheres more closely to the training process without stagnating like transformations 0 and 2 do.

Deploying Third Model ( CNN )

In this section, the third model has been selected as the best model. Here, the results on the combined training data will be shown. This means that the training and validation data were combined to form the new training dataset, retaining the structure and transformation learned from the previous training.

Data Transformation Structure:

Transformation number 3
Second Data augmentation

25 epochs were maintained to allow the model to overfit, aiming to retain a high level of training on the test data to observe the overfitting that couldn't be corrected with the different iterations.

Results #CNN MODEL 3

As observed, the model has indeed overfit as expected; however, the result has been successful. An average accuracy of 70% has been achieved in a testing environment where the test data was correctly assigned a 70%.

GENERAL CONCLUSION

Throughout our iterations, I have been able to construct a convolutional network that achieves a 70% accuracy on the test data, albeit at the cost of overfitting the model.

Transformations have been a crucial aspect in training the model. While further evidence couldn't be included in the document, I'd like to note that the original images contained a lot of noise, something I realized towards the end of the activity. However, Transformation number 3 was able to mitigate it, as shown below.

Overall, I believe it's a good model. We were able to control the noise in the images; however, as future steps and/or recommendations, finding a way to reduce the noise in the spectrograms would be crucial. There's external noise besides the voice or white noise, so the model might learn from this type of noise rather than focusing on the voice.

Regarding the proposed network, I suggest increasing the layers, although I must warn that it's a slow process which affected the development of this article.

References:?

Pavansanagapati (2020) A simple CNN model beginner guide?!!!!!!, Kaggle. Available at: https://www.kaggle.com/code/pavansanagapati/a-simple-cnn-model-beginner-guide (Accessed: 27 April 2024).?

Person (no date) Deep learning with tensorflow, course | Cognitive Class. Available at: https://cognitiveclass.ai/courses/deep-learning-with-tensorflow (Accessed: 21 April 2024).?

Raghav, P. (2019) Understanding of convolutional neural network (CNN) - deep learning, Medium. Available at: https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148 (Accessed: 20? April 2024).?

César Adrián Wisar Ortiz

IT Clinical Systems Specialist . CHRISTUS MUGUERZA

10 个月

excelente aporte Santi , felicidades

2 次回应

要查看或添加评论，请登录

Santiago Reyes Chávez的更多文章

Understanding RAG and Fine Tuning LLM’s using Lora & PEFT

2024年12月18日

Understanding RAG and Fine Tuning LLM’s using Lora & PEFT

1st Santiago Reyes Cha?vez M.N.

2 条评论
LUNG CANCER PREDICTION (CNN APPROACH)

2024年7月8日

LUNG CANCER PREDICTION (CNN APPROACH)

Contents Introduction: About The Data: Approach to Solve The Problem: Exploration Data Analysis (EDA): Approach To…

2 条评论
What is AI ? (Artificial Intelligence)

2024年1月22日

What is AI ? (Artificial Intelligence)

I would like to start with my definition of intelligence, since from there it will develop into deeper concepts…

2 条评论
TuneVogayer | Spotify Song Analysis

2023年12月13日

TuneVogayer | Spotify Song Analysis

Hello Network ! I'm happy to share my latest project, TuneVogayer, a personalized Spotify Dashboard I developed using…

3 条评论

Teaching a Machine to Learn Command Voices

Santiago Reyes Chávez

GCID AIOps Engineer (SRE) at SAP | Master's Student in Applied Artificial Intelligence at Tecnológico de Monterrey | Innovation and Development Engineer | Specialization in Renewable Energy and Advanced Data Analytics

Introduction

Approach to Solve the Problem

Exploratory Data Analysis (EDA):

Approach to Transform Data:

Data Engineering:

First Model( Basic Keras Model )

Data separation:

Results Simple model 1:?

Attempt #1?

Attempt #2

Second Model ( Basic Keras Model )

Data Separation:

领英推荐

Attempt #1

Attempt #2

Third Model ( CNN KERAS MODEL)

Attempt #1

Attempt #2

Attempt #3

Attempt #4

Deploying Third Model ( CNN )

Results #CNN MODEL 3

GENERAL CONCLUSION

References:?

Santiago Reyes Chávez的更多文章

社区洞察

其他会员也浏览了

Neural Networks from Scratch Lecture 1: Coding a neuron and layers

Training Personal Language Models

Data Scientist interview questions and answers based on my experience ( Understand and Read for never getting rejected)

The Future of Work: The Age of AI

Deep Learning for Telecom (with Python) Training Course

Handwritten Text Recognition using Deep Learning (CNN & RNN)

Mathematics and Cybernetics - applied aspects

Problem-Solving in the Age of Deep Learning

Understanding the variance of Variational Autoencoders

Artificial Intelligence basics for Non-techies - 50 Simplified terms

Introduction

Approach to Solve the Problem

Exploratory Data Analysis (EDA):

Approach to Transform Data:

Data Engineering:

First Model( Basic Keras Model )

Data separation:

Results Simple model 1:?

Attempt #1?

Attempt #2

Second Model ( Basic Keras Model )

Data Separation:

领英推荐

Attempt #1

Attempt #2

Third Model ( CNN KERAS MODEL)

Attempt #1

Attempt #2

Attempt #3

Attempt #4

Deploying Third Model ( CNN )

Results #CNN MODEL 3

GENERAL CONCLUSION

References:?

Santiago Reyes Chávez的更多文章

Understanding RAG and Fine Tuning LLM’s using Lora & PEFT

LUNG CANCER PREDICTION (CNN APPROACH)

What is AI ? (Artificial Intelligence)

TuneVogayer | Spotify Song Analysis

社区洞察

其他会员也浏览了

Neural Networks from Scratch Lecture 1: Coding a neuron and layers

Training Personal Language Models

Data Scientist interview questions and answers based on my experience ( Understand and Read for never getting rejected)

The Future of Work: The Age of AI

Deep Learning for Telecom (with Python) Training Course

Handwritten Text Recognition using Deep Learning (CNN & RNN)

Mathematics and Cybernetics - applied aspects

Problem-Solving in the Age of Deep Learning

Understanding the variance of Variational Autoencoders

Artificial Intelligence basics for Non-techies - 50 Simplified terms