Developing LLMs for Generative AI Tokenization and Vectorization
Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
Large Language Generative AI models are developed mostly working with large amounts of text data. For this reason anyone working in this areas should have specific skills in text processing. In this exemplar tutorial we will be discussing NLP, natural language processing, specifically two techniques called Tokenization and Vectorization.
Before starting with these make sure to load the Python libraries and dataset so we can also practically show how Tokenization and Vectorization work.
Before loadining the libraries make sure you have installed the tensorflow library which can be installed using the : pip install tensorflow. You can run this from the command prompt.
Also make sure to download the text dataset which is called Language detection and it may be found here : https://www.kaggle.com/datasets/basilb2s/language-detection.
These two cells of code will load the required libraries, pandas, numpy, tensorflow and string. As you can see the Tokenizer function is also imported,
To enable AI models to learn from text data effectively we must first preprocess text into a format which is understandable to machines. Tokenization is one of the most important steps in this procedure. So what is tokenization? Separating sentences or even whole text documents into words or letters.
This way a machine can understand what is a word in a structural sense and potentially learn from specific words in the text and associations between the words too.
Now since most Large language models today are based on Transformers and Deep Learning architectures, they still work best with numbers, so to enable them to learn from text we should also convert the tokens to numbers, so each word will be represented with a single number instead of sequence of letters.
Before starting with the tokenization/vectorization procedure, we must also make sure to clean the data from unwanted (in this case) punctuation signs.
Now lets see how to perform the, data cleaning, Tokenization and Vectorization procedures on the dataset we downloaded previously, the Language detection dataset in Python
First observe the columns of the dataset.
We can see that each row holds a specific sentences in a specific language and two columns 'Text' and 'Language'.
We can see that first rows are English but there are actually 17 languages in this dataset. We can also see there are a lot of punctuation signs that need to be cleaned.
Using string.punctuation and text.replace() within a function i called clean() most of the data is now in a more optmized form and also converted to lower letters.
领英推荐
Now the text column is ready for Tokenization and Vectorization. Lets use the Tokenizer() function we imported at the begining of the tutorial.
What is interesting is that tensorflow Tokenizer() is not only the tokenizer but also the vectorizer. This means fit the tokenizer to work sequences it will actually perform both tokenization, so separating the text into words and will automatically apply numbers to those words, implementing the vectorization too.
You can now recognize that the first sentence is tokenized and also vectorized, so each word is represented as a separate and unique number. For example, word 'nature' is represented as number 82, word 'sense' is represented as 5884, word 'the' is represented as number 3 and so on. Same vectorization principle is applied to all text in the data, so number 82 means 'nature' in a machine language now in all the data.
The tokenization principle we applied was based on words, so every word is a separate token.
Further practice :
Another way of tokenizing the text data is at the character level, so that each character is now a token. Implementation of the character level Tokenization in Python.
You can see that i added char_level='True' inside the Tokenizer() function and that tokenizer now performed the processing at the character level. Performing the tokenization at the character level can have its advantages and disadvantages when working with AI. One of the advantage is that data resolution is higher and more tokens are in the data, but it may be more difficult for AI to learn the meaning of specific words and pay attention to them.
Thank you for reading and practicing using the LLM development for AI tutorial I - Tokenization and Vectorization. In the next tutorial we will be training the Artificial Neural Networks using Deep Learning to learn how to detect specific languages.
By Darko Medin - AI developer, Data Science Mentor and Consultant
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
1 年You may follow the updates for next parts of the series on the same Newsletter where this article is published 'Advanced Stats/Data Science' via Linkedin.