NLP for Generative AI part 2 Neural Network design
Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
In the previous part it was shown how to preprocess and tokenize the text data for Language processing. This time you will learn how to use TensorFlow to design the Neural Network for processing the text. So lets start ...
First step is is as always loading the required libraries. (Make sure that you have installed the sklearn and tensorflow using the following commands in the command prompt, pip install sklearn and pip install tensorflow).
The second step is to load the Language detection dataset - Language detection.csv and it may be found here : https://www.kaggle.com/datasets/basilb2s/language-detection.
The first and the second are the same as in the previous tutorial and include cleaning the text data and tokenizing it. If you want to go back to the previous tutorial and learn about tokenization of the text data, you may use this link https://www.dhirubhai.net/pulse/developing-llms-generative-ai-tokenization-darko-medin. But in this case since we are performing the training and testing, we need to separate the data into training and testing partitions. We can use the train_test_split() from sklearn.
In this case i added another step which is padding the data and you may see the pad_sequences() function which will make sure that all the sequences are the same length. That's for the input data. The labels are also converted to categorical values and the data is now ready for training.
Now designing the Artificial Neural Network. Instead of using the standard feedforward ANN, we will use the architecture that is good in processing sequential data. There are many layers for this purpose, such as RNN (recurrent neural networks), LSTM (long-short term memory) ANNs, GRU (Gated Recurrent Unit) ANNs and others. In this tutorial, lets design a simple LSTM artificial neural network.
As you can see the Input layers feeds data into the Embedding layer and then the data is processed in 2 LSTM layers with 256 neurons and sent to another Dense() 256 neuron layer before finally sent to the output layer. Keep in mind the output layer needs to have the same number of neurons as the num_classes as this layers outputs will be used for classification. This case num_classes is 17.
领英推荐
Lets compile the model and start the training process!
(note i am using the with tf.device('/GPU') to use my GPU to speed up the training. Training NLP models with a lot of text data may take some time and thats why GPUs are useful to speed up the process. For weaker GPUs this process may take hours. For better ones minutes)
After 130 iterations the model achieved 99.75% train set accuracy and 95.12% vallidation set accuracy. This is quite good result for a language detection model.
I will stop the training at 130 as this is for education purposes and print out the test accuracy using the model.evaluate() as stated in the code above. The test set accuracy is 94.92% which is very similar to the validation accuracy which is also a good indicator in this case...
Ideally i would want the validation accuracy to go over 98 or 99% and in the next tutorial we will see how to use additional feature extraction to improve the accuracy to such levels.
Thanks for reading!
by Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
1 年Here is the code on my github : https://github.com/DarkoMedin/Designing-the-LM-Artificial-Neural-Network and here is the link to previous tutorial in this series https://www.dhirubhai.net/pulse/developing-llms-generative-ai-tokenization-darko-medin