NLP for Generative AI part 2
Neural Network design

NLP for Generative AI part 2 Neural Network design

In the previous part it was shown how to preprocess and tokenize the text data for Language processing. This time you will learn how to use TensorFlow to design the Neural Network for processing the text. So lets start ...

First step is is as always loading the required libraries. (Make sure that you have installed the sklearn and tensorflow using the following commands in the command prompt, pip install sklearn and pip install tensorflow).

No alt text provided for this image

The second step is to load the Language detection dataset - Language detection.csv and it may be found here : https://www.kaggle.com/datasets/basilb2s/language-detection.


No alt text provided for this image

The first and the second are the same as in the previous tutorial and include cleaning the text data and tokenizing it. If you want to go back to the previous tutorial and learn about tokenization of the text data, you may use this link https://www.dhirubhai.net/pulse/developing-llms-generative-ai-tokenization-darko-medin. But in this case since we are performing the training and testing, we need to separate the data into training and testing partitions. We can use the train_test_split() from sklearn.

No alt text provided for this image

In this case i added another step which is padding the data and you may see the pad_sequences() function which will make sure that all the sequences are the same length. That's for the input data. The labels are also converted to categorical values and the data is now ready for training.

No alt text provided for this image

Now designing the Artificial Neural Network. Instead of using the standard feedforward ANN, we will use the architecture that is good in processing sequential data. There are many layers for this purpose, such as RNN (recurrent neural networks), LSTM (long-short term memory) ANNs, GRU (Gated Recurrent Unit) ANNs and others. In this tutorial, lets design a simple LSTM artificial neural network.

No alt text provided for this image

As you can see the Input layers feeds data into the Embedding layer and then the data is processed in 2 LSTM layers with 256 neurons and sent to another Dense() 256 neuron layer before finally sent to the output layer. Keep in mind the output layer needs to have the same number of neurons as the num_classes as this layers outputs will be used for classification. This case num_classes is 17.

No alt text provided for this image


Lets compile the model and start the training process!

(note i am using the with tf.device('/GPU') to use my GPU to speed up the training. Training NLP models with a lot of text data may take some time and thats why GPUs are useful to speed up the process. For weaker GPUs this process may take hours. For better ones minutes)

No alt text provided for this image
No alt text provided for this image

After 130 iterations the model achieved 99.75% train set accuracy and 95.12% vallidation set accuracy. This is quite good result for a language detection model.

I will stop the training at 130 as this is for education purposes and print out the test accuracy using the model.evaluate() as stated in the code above. The test set accuracy is 94.92% which is very similar to the validation accuracy which is also a good indicator in this case...

No alt text provided for this image

Ideally i would want the validation accuracy to go over 98 or 99% and in the next tutorial we will see how to use additional feature extraction to improve the accuracy to such levels.

Thanks for reading!


by Darko Medin


Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

1 年
回复

要查看或添加评论,请登录

Darko Medin的更多文章

社区洞察

其他会员也浏览了