Speech recognition with TensorFlow
Recognition is important ability of human. Patterns are properties of objects and the way how we interpret world which is around us. It is possible to recognize physical and conceptual patterns. This work is about sound patterns which are type of physical patterns, which actuate our senses. This work used automatic speech recognition (ASR) which is procedure of decoding human speech into the sequence of letters which contains the message. There are used English words which are commands for robot ROB-3.
With usage of TensorFlow library which is open-source library and based on book of Chris Mattmann “Machine Learning with TensorFlow” I explain how we can use TensorFlow for speech recognition. This work used neural networks as model which describe human speech. Neural network has input data, hidden layers and output data and weight functions between cells(memory) and layers(time).
Input data are wav files (windows audio video files) recorded at 2000th year. Which are 48kHz samples. With TensorFlow wav files are transformed into spectrograms, which has frequency domain, and then with MFCC transformation (Mel frequency cepstral coefficient) which internally called fast Fourier transformation (FFT) which transform pattern from frequency domain into time domain. Finally signal contain as many amplitudes by sample as much as are letters in English alphabet. That signal is used in space of features with N*494 dimension where N is number of samples in audio file and 494 is in fact 26 letters with context of 9 steps in past and future. For learning are used Bidirectional LSTM layers (long short-term memory layers of neural networks). Output of neural networks is probability distribution with 29 class which contains 26 letters, space, underscore and apostrophe. It is used CTC functions when are letters overlapped in time and is not possible determine which letters are result (connectionist temporal classification). That CTC function looks where are minimum losses in probability. For job I used open-source code from web address: https://github.com/mrubash1/RNN-Tutorial/.
With that code I was able on MX-Linux with installed TensorFlow to train model for human speech and then to recognize human speech. I have opportunity to used CPU of GPU. As I used AMD CPU and AMD GPU, I decide to use CPU. TensorFlow is adapted for Nvidia GPU, currently not for AMD GPU. As result of job, I get printed in console messages with letters which I recognized. After 500 steps training converges into data which can be used for recognition. All code is written in Python which is in contrary with my bachelor work from before 20 years which was written in C++ and used MATLAB.