Generative AI in Tourism. How to train an LLM (part 2) -  Data preprocessing for LLMs

Generative AI in Tourism. How to train an LLM (part 2) - Data preprocessing for LLMs

Data Preprocessing for Large Language Models

?On this, my third article where I've been noting my deep learning process of AI and Large language models (LLMs), I will advance to the second step of training LLMs. I’ve stated on my previous articles the significance and opportunity that exists for all industries, tourism included, that decide to endeavor the challenge of developing their proper LLM.

As we can already witness with all the development going on, these AI trained models can be used for a variety of tasks, such as generating text, translating languages, and writing different kinds of creative content… for now, but what we’ll see in a couple months or years is uninmaginable. This may sound scary for some, but really exciting for others!

No hay texto alternativo para esta imagen
source:vitalflux.com

?So let’s jump into the fundamentals of this second step.

In order to LLMs perform well, they need to be trained on high-quality data. This means that the data needs to be clean, consistent, and representative of the tasks that the LLM will be used for.

?For those new to the subject (like me) data preprocessing is the process of cleaning and formatting raw data so that it can be used for machine learning. This includes tasks such as removing noise, correcting errors, and transforming data into a format that the machine learning algorithm can understand.

??What does data processing mean?

?Let’s imagine you have a messy pile of ingredients for a recipe. Some ingredients may be spoiled, others might be too big or too small, and some may need to be chopped or peeled. Similarly, raw data can have issues like missing values, outliers, or inconsistent formatting. During data preprocessing, what happens is that you take care of these problems. You remove any missing data points or replace them with suitable values. Outliers, which are extreme values that can skew the analysis, are also dealt with. Additionally, you may scale or normalize the data so that all features have a similar range.

No hay texto alternativo para esta imagen
Source: V7Labs


Formatting the data is another crucial step. It involves making sure that the data is in a consistent and appropriate format for the machine learning algorithm you'll use. For example, if you have a date column, you may convert it into a numerical representation or extract useful features like day, month, and year separately. By performing these cleaning and formatting tasks, data preprocessing ensures that the data is reliable, consistent, and ready for the machine learning algorithms to learn from. It helps improve the accuracy and effectiveness of the models built on that data.

?In the context of LLMs, data preprocessing is essential for ensuring that the model learns from the data correctly. If the data is not clean, the model may learn to make incorrect predictions. There are a number of different data preprocessing techniques that can be used for LLMs. In the context of tourism, these tasks can look a bit like the following:

Removing stop words (common words that do not add much meaning, such as "the", "a", and "of"); Stemming or lemmatizing words (reducing words to their root form); Categorizing text (e.g., classifying reviews as positive, negative, or neutral); Normalizing text (e.g., converting all text to lowercase).

?Data preprocessing is important for training an LLM not matter the use it will have or the industry/objective behind it, because it ensures that the data is clean and consistent. This helps to improve the accuracy and performance of the LLM.


No hay texto alternativo para esta imagen
Data preprocessing. source:serokell.io


?How do we preprocess data? Tips, recommendations and examples

?1. Clean the data: This includes removing any errors, inconsistencies, or duplicate data. You may also want to normalize the data, which means converting it to a common format.

Tips:

  • Use a regular expression to find and remove common errors, such as typos and grammatical errors.
  • Use a normalization tool to convert the data to a common format, such as Unicode.
  • Use a duplicate detection tool to find and remove duplicate data.

Recommendations:

  • Use a combination of tools and techniques to clean your data.
  • Be careful not to remove too much data, as this could impact the accuracy of your LLM.

Examples:

  • A common error in tourism data is the misspelling of city names. You could use a regular expression to find and replace all misspellings with the correct spelling.
  • Another common error in tourism data is the use of different formats for dates. You could use a normalization tool to convert all dates to a common format, such as ISO 8601.


2. Tokenize the data. This means breaking the data down into individual words or phrases. This will make it easier for the LLM to understand the data.

Tips:

  • Use a tokenization tool to break the data down into individual words or phrases.
  • Use a stemming or lemmatization tool to reduce words to their root forms.

Recommendations:

  • Use a tokenization tool that is compatible with the LLM that you are using.
  • Use a stemming or lemmatization tool that is appropriate for the language of your data.

Examples:

  • The sentence "I went to the beach" could be tokenized into the following words: "I", "went", "to", "the", "beach".
  • The word "playing" could be stemmed to "play" or lemmatized to "play".


Wait a minute! Did I just say stemming or lemmatization tool?? What is that??

Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root form. The goal is to simplify the words and group together variations of the same word, so that they can be treated as a single entity during text analysis or machine learning tasks.

  • Stemming involves removing the suffixes or prefixes from words to obtain the core root form. For example, if we have the words "running," "runs," and "ran," stemming would reduce them to their common base form "run." T
  • ?Lemmatization, on the other hand, goes a step further by considering the word's meaning and part of speech (noun, verb, adjective, etc.) to determine the base form, called the lemma. For instance, the words "better" and "best" would both be lemmatized to "good" because they share the same lemma. Lemmatization typically involves referencing linguistic databases and applying more complex algorithms compared to stemming.

?Both stemming and lemmatization are used to reduce the vocabulary size, handle word variations, and improve the efficiency and accuracy of text analysis, search engines, and machine learning models that process textual data.

?Ok... now I understand, lets continue!


Remove stop words: Stop words are common words that do not add much meaning to the data. Removing these words can help to reduce the size of the data and improve the performance of the LLM.

Tips:

  • Use a list of stop words to remove common words that do not add much meaning to the data.
  • You can also use a stop word removal tool to automate this process.

Recommendations:

  • Use a list of stop words that is appropriate for the language of your data.
  • Be careful not to remove too many stop words, as this could impact the accuracy of your LLM.

Examples:

  • The word "the" is a common stop word in English. You could remove this word from the data to reduce its size.


Stem or lemmatize the data: As I explained earlier, stemming and lemmatization are processes that reduce words to their root forms. This can help to improve the accuracy of the LLM by reducing the number of different words that it needs to learn.

Tips:

  • Use a stemming or lemmatization tool to reduce words to their root forms.
  • This can help to improve the accuracy of the LLM by reducing the number of different words that it needs to learn.

Recommendations:

  • Use a stemming or lemmatization tool that is appropriate for the language of your data.
  • Be careful not to over-stem or over-lemmatize the data, as this could impact the accuracy of your LLM.

Examples:

  • The words "playing", "played", and "plays" could all be stemmed to the root word "play".
  • The word "walking" could be lemmatized to "walk".?


Create a vocabulary: A vocabulary is a list of all the words that will be used by the LLM. This will help the LLM to learn the relationships between words and improve its ability to generate text.

Tips:

  • Create a list of all the words that will be used by the LLM.
  • This will help the LLM to learn the relationships between words and improve its ability to generate text.

Recommendations:

  • Use a tool to create a vocabulary from your data.
  • Be sure to include all of the words that are important for your LLM, such as the names of places, activities, and people.

Examples:

  • The word "beach" could be included in the vocabulary for a tourism LLM.
  • The word "tour" could also be included, as it is related to the concept of tourism.

?I’m sure you must have noticed a couple of tools I mentioned on the tips above that can turn out to be helpful or even do the job. If you want to go in deeper or if you are actually in the process, here’s a little help:

No hay texto alternativo para esta imagen
Tools for data preprocessing

Data preprocessing will allow us to:

·??????Achieve improved accuracy: By removing noise and errors from the data, preprocessing can help to improve the accuracy of the machine learning algorithm.

·??????Increased performance: it can also help to increase the performance of the machine learning algorithm by reducing the amount of data that needs to be processed.

·??????Improved interpretability of the machine learning algorithm by making it easier to understand how the algorithm is making decisions.

Well, what do you know… I’ve just taken you through the second step on training our LLM with only 5 to go to master it!


On my next article I’ll move forward to step 3: How to choose a Machine Learning Framework. Choosing the right machine learning framework for training a LLM is important because it makes the development process easier, improves performance, and provides helpful tools and libraries. A good framework should be user-friendly, fast, and efficient. It should offer pre-built model architectures and components specific to language models. Scalability is also important for handling large datasets or complex models. Lastly, picking the right framework helps you develop LLMs more easily and effectively.

Thanks for joining me in this journey of how to train an LLM! See you on step 3!

#AI #LLM #GPT3 #GPT4 #MachineLearning #TourismAI #HospitalityAI #IA #IAenturismo #consultorturismo

Deepthi Sankepalli

Data Science Architect @ValueLabs, Generative AI PG @ IISc, Langchain, Expert SQL, AWS, Snowflake, Conversational AI, Chatbot, RAG, LLM AI Agents, Agentic Applications

7 个月

But LLM model BERT uses BPE (Byte- Pair Encoding)?to shrink its root word, for example, word like play and playing will be decoded to?play + ##ing.?Stemming/Lemma is not needed for these models.

回复

要查看或添加评论,请登录

Franklin Carpenter的更多文章

社区洞察

其他会员也浏览了