Teaching a Machine to Learn Command Voices
Santiago Reyes Chávez
GCID AIOps Engineer (SRE) at SAP | Master's Student in Applied Artificial Intelligence at Tecnológico de Monterrey | Innovation and Development Engineer | Specialization in Renewable Energy and Advanced Data Analytics
Note: This article is a restructure of one of my assignments submitted as part of my studies at Tsinghua University. I would like to acknowledge the institution for providing the data and the opportunity to work on this project.
Introduction
The great opportunity that Big Data offers lies in the various approaches in the industry to work with different types of data, as well as different architectures. Within the data variety, this article will focus on images and how to process them to predict categories.
In this task, the goal is to classify 24 voice commands, which in a productive/real environment we can observe devices that already have such technology, such as Alexa, Siri, or other voice recognitions. The approach to correctly classify the different voice commands is by converting the audio into images and subsequently developing a Convolutional Neural Network (CNN).
A CNN is an architecture designed to process images, composed of layers of neurons that extract features from the images while simultaneously learning from them. A convolution, as the name suggests in CNNs, is the data input (in this case, pixels of the image) that applies a filter to the image, detecting relevant features in local regions of the image. The filters chosen in this article were of size 2x2, allowing them to identify edges, shapes, and textures in the spectrograms produced.
The process is accompanied by pooling layers, activation layers, and dense layers (fully connected), which help in reducing dimensionality, removing linearity from the model, and enabling the final classification of the spectrogram.
Approach to Solve the Problem
After receiving the audio transformed into spectrograms, the provided information consists of a training set and a test set. The information was structured to have the images in subfolders labeled from 0 to 23 in both datasets to have a complete structure.?
Exploratory Data Analysis (EDA):
Since it's unstructured data, an exploratory data analysis (EDA) was conducted, segmented into various sections:
Approach to Transform Data:
When processing images, our original image values are 800 pixels wide by 513 pixels tall. Processing images at these dimensions would be very demanding when applying different filters in the neural network. Therefore, I've chosen to reduce the dimensions of the images to one-third to facilitate processing.
On the other hand, three approaches to image transformation have been adopted, meaning changing the saturation level, brightness, contrast, etc.
As an important note, these types of transformations are applied to images in both the training, validation, and test sets since they are general transformers, with the hope that the model can better learn from them.
Approach 1:
In this approach, we aim to slightly edit the images to seek improvement.
Approach 2:
In this approach, the goal is to regularly modify the image.
Approach 3:
In this approach, the goal is to more aggressively modify the image.
Data Engineering:
In this approach, data augmentation was decided upon to mitigate the overfitting issue that the model primarily encountered. This was done by randomly selecting images and duplicating them while changing certain aspects such as rotations, zoom, translation, and random changes in contrast and brightness. The aim was to feed the model with a greater variety of images so that it could learn their features and avoid overfitting.
Data Augmentation Proposition 1:
Include simple Keras layers where the following actions are performed:
These changes can be observed in the following image, which corresponds to just one image.
Data Augmentation Proposition 2:
Include simple Keras layers where the following actions are performed:
These changes can be observed in the following image, which corresponds to just one image.
First Model( Basic Keras Model )
Data separation:
The datasets were split into test and validation sets (20% of the training set) to run a simple two-stage model. First, the model was trained on untransformed data, and then with a transformation on the training and validation sets. Finally, it was applied to the test set.
Results Simple model 1:?
In this model number 1, we considered 4 convolutional layers with max pooling, and finally, 2 dense layers were contemplated where all activation functions are "ReLu" and the output is the number of existing classes.
Different combinations were tested in the first model following the training and validation partition. The goal was to find a model that could achieve around 60% accuracy to deploy it in the test set.
Attempt #1?
As observed from both graphs, the model is overfitting on the training data, meaning it struggles to predict variables it hasn't seen. However, having an accuracy of 40% is not bad for predicting 24 categories of voice commands, but there is certainly room for improvement.
First Attempt Combination:
In this model, as well as in the following ones, 20 epochs were used to train the neural network. Later in the document, the reason for keeping few epochs will be expanded upon, and it's due to the computational processing level of the CNN.
Attempt #2
As can be observed in this model, the model has been completely overfit, although an accuracy of around 45% can be observed in the model.
This could be due to either a weak model or a lack of information for training. It is important to take into account that this is a simple model.
Second Attempt Combination:
Second Model ( Basic Keras Model )
In this case, it consists of the same CNN neural network structure as before. However, the only difference is that in the first layer, the first data augmentation layer is added to the neural network.
Similarly, the goal of this section is to find a model that can fit and achieve a general value greater than 50% in validation to be deployed.
Data Separation:
The datasets were split into test and validation sets (20% of the training set) to run a simple two-stage model. First, the model was trained on untransformed data, and then with a transformation on the training and validation sets. Finally, it was applied to the test set.
领英推荐
Basically, the same information cutoff will be used until finding the model whose highest iteration is greater than 50%.
Attempt #1
In this case, it can be observed that the model has failed to capture information relevant enough to achieve at least 10% accuracy in train and even less in validation.
First Attempt Combination:
On this matter, points can be deducted from transformation number 2 for not bringing the necessary information for the CNN.
Attempt #2
Similarly to the first attempt, the model fails to meet any expectations of finding sufficient features to create relationships and make predictions.
Second Attempt Combination:
Finally, since two iterations have been completed, I can conclude with the iterations of model number 2 and begin with the next iterations with model number 3, which will be given a much more complex structure.?}
Third Model ( CNN KERAS MODEL)
What makes this model special is its complexity, given that it's equipped with 4 convolutional layers specifying a 3x3 pixel filter, where a stride of 2 pixels has been defined, which can decrease the training process while potentially mapping the spectrogram much more precisely.
Finally, it passes through normalization layers for each layer, both for scaling and convolution, where there's a pooling layer to decrease the dimensionality created by the convolutions. Similarly to the previous models, two dense layers are used, but there are normalization and dropout layers to eliminate linearity from the model, so the last layer accounts for 24 classes.
Another significant difference is the final activation function, "softmax," which allows for easier and more accurate category predictions.
Similarly, the data separation remains the same structure, with the goal of finding the first model that mostly achieves around 50% accuracy. Once found, the model will be trained and sent to test with all the training information.
Attempt #1
In this first attempt, a drastic change can be observed between the second and first models, where the loss is significantly reduced, and the accuracy approaches the training process. However, it still doesn't meet the expectations of achieving 50% accuracy in validation.
First Attempt Combination:
Different combinations of transformations will be tested to seek better performance.
Attempt #2
In this model, the goal of achieving a 50% accuracy has been successfully met, in fact, the model has reached a peak of 61%, exceeding expectations. However, the model has overfit and experienced a deterioration in epoch number 14, where it became overfitted.
Second Attempt Combination:
Different combinations of transformations will be tested to seek better performance.
Attempt #3
Similarly, the goal has been met, especially with an improvement in maintaining around 60% accuracy after epoch number 10. This indicates that we have been able to generate a model with a combination that better fits unseen information, meaning the information is consumed more relevantly.
Third Attempt Combination:
Different combinations of transformations will be tested to seek better performance.
Attempt #4
In the fourth attempt, the number of epochs for the model was increased in order to see if there was any increase in accuracy after attempt number 20. However, there was no such increase.
Fourth Attempt Combination:
As demonstrated throughout the attempts, the best image transformation has been number 3, as it adheres more closely to the training process without stagnating like transformations 0 and 2 do.
Deploying Third Model ( CNN )
In this section, the third model has been selected as the best model. Here, the results on the combined training data will be shown. This means that the training and validation data were combined to form the new training dataset, retaining the structure and transformation learned from the previous training.
Data Transformation Structure:
25 epochs were maintained to allow the model to overfit, aiming to retain a high level of training on the test data to observe the overfitting that couldn't be corrected with the different iterations.
Results #CNN MODEL 3
As observed, the model has indeed overfit as expected; however, the result has been successful. An average accuracy of 70% has been achieved in a testing environment where the test data was correctly assigned a 70%.
GENERAL CONCLUSION
Throughout our iterations, I have been able to construct a convolutional network that achieves a 70% accuracy on the test data, albeit at the cost of overfitting the model.
Transformations have been a crucial aspect in training the model. While further evidence couldn't be included in the document, I'd like to note that the original images contained a lot of noise, something I realized towards the end of the activity. However, Transformation number 3 was able to mitigate it, as shown below.
Overall, I believe it's a good model. We were able to control the noise in the images; however, as future steps and/or recommendations, finding a way to reduce the noise in the spectrograms would be crucial. There's external noise besides the voice or white noise, so the model might learn from this type of noise rather than focusing on the voice.
Regarding the proposed network, I suggest increasing the layers, although I must warn that it's a slow process which affected the development of this article.
References:?
Pavansanagapati (2020) A simple CNN model beginner guide?!!!!!!, Kaggle. Available at: https://www.kaggle.com/code/pavansanagapati/a-simple-cnn-model-beginner-guide (Accessed: 27 April 2024).?
Person (no date) Deep learning with tensorflow, course | Cognitive Class. Available at: https://cognitiveclass.ai/courses/deep-learning-with-tensorflow (Accessed: 21 April 2024).?
Raghav, P. (2019) Understanding of convolutional neural network (CNN) - deep learning, Medium. Available at: https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148 (Accessed: 20? April 2024).?
IT Clinical Systems Specialist . CHRISTUS MUGUERZA
10 个月excelente aporte Santi , felicidades