登录查看更多内容

Developing LLMs for Generative AI Tokenization and Vectorization

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

发布日期: 2023年7月3日

Large Language Generative AI models are developed mostly working with large amounts of text data. For this reason anyone working in this areas should have specific skills in text processing. In this exemplar tutorial we will be discussing NLP, natural language processing, specifically two techniques called Tokenization and Vectorization.

Before starting with these make sure to load the Python libraries and dataset so we can also practically show how Tokenization and Vectorization work.

Before loadining the libraries make sure you have installed the tensorflow library which can be installed using the : pip install tensorflow. You can run this from the command prompt.

Also make sure to download the text dataset which is called Language detection and it may be found here : https://www.kaggle.com/datasets/basilb2s/language-detection.

These two cells of code will load the required libraries, pandas, numpy, tensorflow and string. As you can see the Tokenizer function is also imported,

To enable AI models to learn from text data effectively we must first preprocess text into a format which is understandable to machines. Tokenization is one of the most important steps in this procedure. So what is tokenization? Separating sentences or even whole text documents into words or letters.

This way a machine can understand what is a word in a structural sense and potentially learn from specific words in the text and associations between the words too.

Now since most Large language models today are based on Transformers and Deep Learning architectures, they still work best with numbers, so to enable them to learn from text we should also convert the tokens to numbers, so each word will be represented with a single number instead of sequence of letters.

Before starting with the tokenization/vectorization procedure, we must also make sure to clean the data from unwanted (in this case) punctuation signs.

Now lets see how to perform the, data cleaning, Tokenization and Vectorization procedures on the dataset we downloaded previously, the Language detection dataset in Python

First observe the columns of the dataset.

We can see that each row holds a specific sentences in a specific language and two columns 'Text' and 'Language'.

We can see that first rows are English but there are actually 17 languages in this dataset. We can also see there are a lot of punctuation signs that need to be cleaned.

Using string.punctuation and text.replace() within a function i called clean() most of the data is now in a more optmized form and also converted to lower letters.

领英推荐

Autonomous Ops with LLM for Advanced Anomaly Detection

XenonStack 1 年前

Mastering Long Document Insights: Advanced…

Gary Stafford 1 年前

BERT

Darshika Srivastava 1 年前

Now the text column is ready for Tokenization and Vectorization. Lets use the Tokenizer() function we imported at the begining of the tutorial.

What is interesting is that tensorflow Tokenizer() is not only the tokenizer but also the vectorizer. This means fit the tokenizer to work sequences it will actually perform both tokenization, so separating the text into words and will automatically apply numbers to those words, implementing the vectorization too.

You can now recognize that the first sentence is tokenized and also vectorized, so each word is represented as a separate and unique number. For example, word 'nature' is represented as number 82, word 'sense' is represented as 5884, word 'the' is represented as number 3 and so on. Same vectorization principle is applied to all text in the data, so number 82 means 'nature' in a machine language now in all the data.

The tokenization principle we applied was based on words, so every word is a separate token.

Further practice :

Another way of tokenizing the text data is at the character level, so that each character is now a token. Implementation of the character level Tokenization in Python.

You can see that i added char_level='True' inside the Tokenizer() function and that tokenizer now performed the processing at the character level. Performing the tokenization at the character level can have its advantages and disadvantages when working with AI. One of the advantage is that data resolution is higher and more tokens are in the data, but it may be more difficult for AI to learn the meaning of specific words and pay attention to them.

Thank you for reading and practicing using the LLM development for AI tutorial I - Tokenization and Vectorization. In the next tutorial we will be training the Artificial Neural Networks using Deep Learning to learn how to detect specific languages.

By Darko Medin - AI developer, Data Science Mentor and Consultant

Advanced Stats / Data Science

12,679 位关注者

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

1 年

You may follow the updates for next parts of the series on the same Newsletter where this article is published 'Advanced Stats/Data Science' via Linkedin.

要查看或添加评论，请登录

Darko Medin的更多文章

OncoNeo400 - New AI Confidence Interval feature

2025年3月25日

OncoNeo400 - New AI Confidence Interval feature

What's one of the main aspects that can bring a Statistical Advantage to an AI model? Improving individual predictions…
OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

2025年3月16日

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

In this edition the OncoNeo400, novel Precision Oncology Research AI tool on BioAIWorks platform (bioaiworks.com).

7 条评论
LARVOL CLIN - New modules

2025年3月3日

LARVOL CLIN - New modules

This featuring article is about the new modules Larvol Pseudo-IPD and Larvol NMA on https://clin.larvol.

1 条评论
AI Developer tech skillsets.

2025年2月24日

AI Developer tech skillsets.

While these skills may vary according to the role, i will discuss the most significant ones that almost every AI…

2 条评论
Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

2025年2月16日

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

The book How To Be an Effective Statistician: A Guide for Statisticians, Data Scientists, and Other Quantitative…

2 条评论
Causal Inference II Live - The ORIENTATION

2025年2月11日

Causal Inference II Live - The ORIENTATION

Causal Inference II is a Live Linkedin Event by Justin Bélair and Darko Medin . Here is the orientation on how and when…

9 条评论
Simulated and Synthetic Data Generation - Edition 1

2024年10月31日

Simulated and Synthetic Data Generation - Edition 1

The first in the series for Simulated and Synthetic Data Generation - by Darko Medin. Where to read :…
Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

2024年10月20日

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

This is the orientation for my upcoming Series on Simulated and Synthetic Data. If you have any additional suggestions…

5 条评论
Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

2024年10月13日

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

In today's data-driven world ability to generate Simulated and Synthetic data is one of the most important Data Science…
INTRODUCTION TO DEEP LEARNING

2024年10月3日

INTRODUCTION TO DEEP LEARNING

The INTRODUCTION TO DEEP LEARNING tutorial. Where to find? adatascience.

See all articles

Developing LLMs for Generative AI Tokenization and Vectorization

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

领英推荐

Advanced Stats / Data Science

12,679 位关注者

Darko Medin的更多文章

社区洞察

其他会员也浏览了

The Future of AI with Python: Trends and Predictions

A Friendly Introduction to Stemming in Natural Language Processing

Natural Language Processing using Python

Text Analysis with Machine Learning in Python

?? Unlocking NLP Mastery: BERT + Python in Action ??

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

?? ALBERT: Transforming NLP with Lightweight Innovation ??

NLP with Python Part 2 NLTK

Tools and Technologies for Building AI in Travel

Ready to Train Your Own LLM? Dive In with Code!

领英推荐

Advanced Stats / Data Science

12,679 位关注者

Darko Medin的更多文章

OncoNeo400 - New AI Confidence Interval feature

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

LARVOL CLIN - New modules

AI Developer tech skillsets.

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

Causal Inference II Live - The ORIENTATION

Simulated and Synthetic Data Generation - Edition 1

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

INTRODUCTION TO DEEP LEARNING

社区洞察

其他会员也浏览了

The Future of AI with Python: Trends and Predictions

A Friendly Introduction to Stemming in Natural Language Processing

Natural Language Processing using Python

Text Analysis with Machine Learning in Python

?? Unlocking NLP Mastery: BERT + Python in Action ??

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

?? ALBERT: Transforming NLP with Lightweight Innovation ??

NLP with Python Part 2 NLTK

Tools and Technologies for Building AI in Travel

Ready to Train Your Own LLM? Dive In with Code!