Generative AI in Tourism. How to train an LLM (part 2) - Data preprocessing for LLMs
Franklin Carpenter
Ayudo a destinos y hoteles a dise?ar e implementar su estrategia IA
Data Preprocessing for Large Language Models
?On this, my third article where I've been noting my deep learning process of AI and Large language models (LLMs), I will advance to the second step of training LLMs. I’ve stated on my previous articles the significance and opportunity that exists for all industries, tourism included, that decide to endeavor the challenge of developing their proper LLM.
As we can already witness with all the development going on, these AI trained models can be used for a variety of tasks, such as generating text, translating languages, and writing different kinds of creative content… for now, but what we’ll see in a couple months or years is uninmaginable. This may sound scary for some, but really exciting for others!
?So let’s jump into the fundamentals of this second step.
In order to LLMs perform well, they need to be trained on high-quality data. This means that the data needs to be clean, consistent, and representative of the tasks that the LLM will be used for.
?For those new to the subject (like me) data preprocessing is the process of cleaning and formatting raw data so that it can be used for machine learning. This includes tasks such as removing noise, correcting errors, and transforming data into a format that the machine learning algorithm can understand.
??What does data processing mean?
?Let’s imagine you have a messy pile of ingredients for a recipe. Some ingredients may be spoiled, others might be too big or too small, and some may need to be chopped or peeled. Similarly, raw data can have issues like missing values, outliers, or inconsistent formatting. During data preprocessing, what happens is that you take care of these problems. You remove any missing data points or replace them with suitable values. Outliers, which are extreme values that can skew the analysis, are also dealt with. Additionally, you may scale or normalize the data so that all features have a similar range.
Formatting the data is another crucial step. It involves making sure that the data is in a consistent and appropriate format for the machine learning algorithm you'll use. For example, if you have a date column, you may convert it into a numerical representation or extract useful features like day, month, and year separately. By performing these cleaning and formatting tasks, data preprocessing ensures that the data is reliable, consistent, and ready for the machine learning algorithms to learn from. It helps improve the accuracy and effectiveness of the models built on that data.
?In the context of LLMs, data preprocessing is essential for ensuring that the model learns from the data correctly. If the data is not clean, the model may learn to make incorrect predictions. There are a number of different data preprocessing techniques that can be used for LLMs. In the context of tourism, these tasks can look a bit like the following:
Removing stop words (common words that do not add much meaning, such as "the", "a", and "of"); Stemming or lemmatizing words (reducing words to their root form); Categorizing text (e.g., classifying reviews as positive, negative, or neutral); Normalizing text (e.g., converting all text to lowercase).
?Data preprocessing is important for training an LLM not matter the use it will have or the industry/objective behind it, because it ensures that the data is clean and consistent. This helps to improve the accuracy and performance of the LLM.
?How do we preprocess data? Tips, recommendations and examples
?1. Clean the data: This includes removing any errors, inconsistencies, or duplicate data. You may also want to normalize the data, which means converting it to a common format.
Tips:
Recommendations:
Examples:
2. Tokenize the data. This means breaking the data down into individual words or phrases. This will make it easier for the LLM to understand the data.
Tips:
Recommendations:
Examples:
Wait a minute! Did I just say stemming or lemmatization tool?? What is that??
Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root form. The goal is to simplify the words and group together variations of the same word, so that they can be treated as a single entity during text analysis or machine learning tasks.
领英推荐
?Both stemming and lemmatization are used to reduce the vocabulary size, handle word variations, and improve the efficiency and accuracy of text analysis, search engines, and machine learning models that process textual data.
?Ok... now I understand, lets continue!
Remove stop words: Stop words are common words that do not add much meaning to the data. Removing these words can help to reduce the size of the data and improve the performance of the LLM.
Tips:
Recommendations:
Examples:
Stem or lemmatize the data: As I explained earlier, stemming and lemmatization are processes that reduce words to their root forms. This can help to improve the accuracy of the LLM by reducing the number of different words that it needs to learn.
Tips:
Recommendations:
Examples:
Create a vocabulary: A vocabulary is a list of all the words that will be used by the LLM. This will help the LLM to learn the relationships between words and improve its ability to generate text.
Tips:
Recommendations:
Examples:
?I’m sure you must have noticed a couple of tools I mentioned on the tips above that can turn out to be helpful or even do the job. If you want to go in deeper or if you are actually in the process, here’s a little help:
Data preprocessing will allow us to:
·??????Achieve improved accuracy: By removing noise and errors from the data, preprocessing can help to improve the accuracy of the machine learning algorithm.
·??????Increased performance: it can also help to increase the performance of the machine learning algorithm by reducing the amount of data that needs to be processed.
·??????Improved interpretability of the machine learning algorithm by making it easier to understand how the algorithm is making decisions.
Well, what do you know… I’ve just taken you through the second step on training our LLM with only 5 to go to master it!
On my next article I’ll move forward to step 3: How to choose a Machine Learning Framework. Choosing the right machine learning framework for training a LLM is important because it makes the development process easier, improves performance, and provides helpful tools and libraries. A good framework should be user-friendly, fast, and efficient. It should offer pre-built model architectures and components specific to language models. Scalability is also important for handling large datasets or complex models. Lastly, picking the right framework helps you develop LLMs more easily and effectively.
Thanks for joining me in this journey of how to train an LLM! See you on step 3!
Data Science Architect @ValueLabs, Generative AI PG @ IISc, Langchain, Expert SQL, AWS, Snowflake, Conversational AI, Chatbot, RAG, LLM AI Agents, Agentic Applications
7 个月But LLM model BERT uses BPE (Byte- Pair Encoding)?to shrink its root word, for example, word like play and playing will be decoded to?play + ##ing.?Stemming/Lemma is not needed for these models.