登录查看更多内容

Generative AI in Tourism. How to train an LLM (part 2) - Data preprocessing for LLMs

Franklin Carpenter

Ayudo a destinos y hoteles a dise?ar e implementar su estrategia IA

发布日期: 2023年6月26日

Data Preprocessing for Large Language Models

?On this, my third article where I've been noting my deep learning process of AI and Large language models (LLMs), I will advance to the second step of training LLMs. I’ve stated on my previous articles the significance and opportunity that exists for all industries, tourism included, that decide to endeavor the challenge of developing their proper LLM.

As we can already witness with all the development going on, these AI trained models can be used for a variety of tasks, such as generating text, translating languages, and writing different kinds of creative content… for now, but what we’ll see in a couple months or years is uninmaginable. This may sound scary for some, but really exciting for others!

No hay texto alternativo para esta imagen — source:vitalflux.com

?So let’s jump into the fundamentals of this second step.

In order to LLMs perform well, they need to be trained on high-quality data. This means that the data needs to be clean, consistent, and representative of the tasks that the LLM will be used for.

?For those new to the subject (like me) data preprocessing is the process of cleaning and formatting raw data so that it can be used for machine learning. This includes tasks such as removing noise, correcting errors, and transforming data into a format that the machine learning algorithm can understand.

??What does data processing mean?

?Let’s imagine you have a messy pile of ingredients for a recipe. Some ingredients may be spoiled, others might be too big or too small, and some may need to be chopped or peeled. Similarly, raw data can have issues like missing values, outliers, or inconsistent formatting. During data preprocessing, what happens is that you take care of these problems. You remove any missing data points or replace them with suitable values. Outliers, which are extreme values that can skew the analysis, are also dealt with. Additionally, you may scale or normalize the data so that all features have a similar range.

Formatting the data is another crucial step. It involves making sure that the data is in a consistent and appropriate format for the machine learning algorithm you'll use. For example, if you have a date column, you may convert it into a numerical representation or extract useful features like day, month, and year separately. By performing these cleaning and formatting tasks, data preprocessing ensures that the data is reliable, consistent, and ready for the machine learning algorithms to learn from. It helps improve the accuracy and effectiveness of the models built on that data.

?In the context of LLMs, data preprocessing is essential for ensuring that the model learns from the data correctly. If the data is not clean, the model may learn to make incorrect predictions. There are a number of different data preprocessing techniques that can be used for LLMs. In the context of tourism, these tasks can look a bit like the following:

Removing stop words (common words that do not add much meaning, such as "the", "a", and "of"); Stemming or lemmatizing words (reducing words to their root form); Categorizing text (e.g., classifying reviews as positive, negative, or neutral); Normalizing text (e.g., converting all text to lowercase).

?Data preprocessing is important for training an LLM not matter the use it will have or the industry/objective behind it, because it ensures that the data is clean and consistent. This helps to improve the accuracy and performance of the LLM.

?How do we preprocess data? Tips, recommendations and examples

?1. Clean the data: This includes removing any errors, inconsistencies, or duplicate data. You may also want to normalize the data, which means converting it to a common format.

Tips:

Use a regular expression to find and remove common errors, such as typos and grammatical errors.
Use a normalization tool to convert the data to a common format, such as Unicode.
Use a duplicate detection tool to find and remove duplicate data.

Recommendations:

Use a combination of tools and techniques to clean your data.
Be careful not to remove too much data, as this could impact the accuracy of your LLM.

Examples:

A common error in tourism data is the misspelling of city names. You could use a regular expression to find and replace all misspellings with the correct spelling.
Another common error in tourism data is the use of different formats for dates. You could use a normalization tool to convert all dates to a common format, such as ISO 8601.

2. Tokenize the data. This means breaking the data down into individual words or phrases. This will make it easier for the LLM to understand the data.

Tips:

Use a tokenization tool to break the data down into individual words or phrases.
Use a stemming or lemmatization tool to reduce words to their root forms.

Recommendations:

Use a tokenization tool that is compatible with the LLM that you are using.
Use a stemming or lemmatization tool that is appropriate for the language of your data.

Examples:

The sentence "I went to the beach" could be tokenized into the following words: "I", "went", "to", "the", "beach".
The word "playing" could be stemmed to "play" or lemmatized to "play".

Wait a minute! Did I just say stemming or lemmatization tool?? What is that??

Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root form. The goal is to simplify the words and group together variations of the same word, so that they can be treated as a single entity during text analysis or machine learning tasks.

Stemming involves removing the suffixes or prefixes from words to obtain the core root form. For example, if we have the words "running," "runs," and "ran," stemming would reduce them to their common base form "run." T
?Lemmatization, on the other hand, goes a step further by considering the word's meaning and part of speech (noun, verb, adjective, etc.) to determine the base form, called the lemma. For instance, the words "better" and "best" would both be lemmatized to "good" because they share the same lemma. Lemmatization typically involves referencing linguistic databases and applying more complex algorithms compared to stemming.

MIT Technology Review 1 个月前

Top AI & Machine Learning Newsletters of 2023

Michael Spencer 1 年前

Multimodal Retrieval Augmented Generation…

Open Data Science Conference (ODSC) 8 个月前

?Both stemming and lemmatization are used to reduce the vocabulary size, handle word variations, and improve the efficiency and accuracy of text analysis, search engines, and machine learning models that process textual data.

?Ok... now I understand, lets continue!

Remove stop words: Stop words are common words that do not add much meaning to the data. Removing these words can help to reduce the size of the data and improve the performance of the LLM.

Tips:

Use a list of stop words to remove common words that do not add much meaning to the data.
You can also use a stop word removal tool to automate this process.

Recommendations:

Use a list of stop words that is appropriate for the language of your data.
Be careful not to remove too many stop words, as this could impact the accuracy of your LLM.

Examples:

The word "the" is a common stop word in English. You could remove this word from the data to reduce its size.

Stem or lemmatize the data: As I explained earlier, stemming and lemmatization are processes that reduce words to their root forms. This can help to improve the accuracy of the LLM by reducing the number of different words that it needs to learn.

Tips:

Use a stemming or lemmatization tool to reduce words to their root forms.
This can help to improve the accuracy of the LLM by reducing the number of different words that it needs to learn.

Recommendations:

Use a stemming or lemmatization tool that is appropriate for the language of your data.
Be careful not to over-stem or over-lemmatize the data, as this could impact the accuracy of your LLM.

Examples:

The words "playing", "played", and "plays" could all be stemmed to the root word "play".
The word "walking" could be lemmatized to "walk".?

Create a vocabulary: A vocabulary is a list of all the words that will be used by the LLM. This will help the LLM to learn the relationships between words and improve its ability to generate text.

Tips:

Create a list of all the words that will be used by the LLM.
This will help the LLM to learn the relationships between words and improve its ability to generate text.

Recommendations:

Use a tool to create a vocabulary from your data.
Be sure to include all of the words that are important for your LLM, such as the names of places, activities, and people.

Examples:

The word "beach" could be included in the vocabulary for a tourism LLM.
The word "tour" could also be included, as it is related to the concept of tourism.

?I’m sure you must have noticed a couple of tools I mentioned on the tips above that can turn out to be helpful or even do the job. If you want to go in deeper or if you are actually in the process, here’s a little help:

Data preprocessing will allow us to:

·??????Achieve improved accuracy: By removing noise and errors from the data, preprocessing can help to improve the accuracy of the machine learning algorithm.

·??????Increased performance: it can also help to increase the performance of the machine learning algorithm by reducing the amount of data that needs to be processed.

·??????Improved interpretability of the machine learning algorithm by making it easier to understand how the algorithm is making decisions.

Well, what do you know… I’ve just taken you through the second step on training our LLM with only 5 to go to master it!

On my next article I’ll move forward to step 3: How to choose a Machine Learning Framework. Choosing the right machine learning framework for training a LLM is important because it makes the development process easier, improves performance, and provides helpful tools and libraries. A good framework should be user-friendly, fast, and efficient. It should offer pre-built model architectures and components specific to language models. Scalability is also important for handling large datasets or complex models. Lastly, picking the right framework helps you develop LLMs more easily and effectively.

Thanks for joining me in this journey of how to train an LLM! See you on step 3!

#AI #LLM #GPT3 #GPT4 #MachineLearning #TourismAI #HospitalityAI #IA #IAenturismo #consultorturismo

Deepthi Sankepalli

Data Science Architect @ValueLabs, Generative AI PG @ IISc, Langchain, Expert SQL, AWS, Snowflake, Conversational AI, Chatbot, RAG, LLM AI Agents, Agentic Applications

7 个月

But LLM model BERT uses BPE (Byte- Pair Encoding)?to shrink its root word, for example, word like play and playing will be decoded to?play + ##ing.?Stemming/Lemma is not needed for these models.

要查看或添加评论，请登录

Franklin Carpenter的更多文章

AI + tourism: ?Will hybrid and adaptive AI systems takeover our destinations in the future?

2024年10月15日

AI + tourism: ?Will hybrid and adaptive AI systems takeover our destinations in the future?

As promised, in this new article I’ll be sharing my thoughts and vision of what I believe might be a clear example of…
How hybrid AI can bring our tourist places to Life

2024年9月30日

How hybrid AI can bring our tourist places to Life

As a tourism professional I thought I knew every facet of our industry. However, the concept of Hybrid and Adaptive AI…

6 条评论
La importancia de anticipar tendencias en el sector de los viajes

2024年7月15日

La importancia de anticipar tendencias en el sector de los viajes

Un reciente estudio presentado por la European Travel Commission y destacado en Skift (pueden acceder al estudio aquí:…

3 条评论
Más allá de la turismofobia: El reto de reinventar la ciudad.

2024年7月9日

Más allá de la turismofobia: El reto de reinventar la ciudad.

????Los recientes incidentes ?? ??en Barcelona han puesto de manifiesto una realidad incómoda que muchas ciudades…

3 条评论
La importancia de los datos en la gestión del turismo en Chile.

2024年7月3日

La importancia de los datos en la gestión del turismo en Chile.

En un mundo cada vez más interconectado y dependiente de la información, el acceso a datos fiables, actualizados y de…

4 条评论
A Glimpse into the future: Disruptive AI innovations that could revolutionize tourism.

2024年6月10日

A Glimpse into the future: Disruptive AI innovations that could revolutionize tourism.

Predicting the future has become increasingly challenging. I’m sure most of you agree that It’s imposible to do so, as…

7 条评论
Los fondos del Royalty Minero: Una oportunidad para impulsar el turismo sustentable en las regiones de Chile

2024年4月7日

Los fondos del Royalty Minero: Una oportunidad para impulsar el turismo sustentable en las regiones de Chile

Este 2??0??2??4?? entraron en vigencia los tres fondos de desarrollo regional y local creados bajo la nueva ley 21.591…

7 条评论
Dynamic Pricing: A Game-Changer for Sustainable Tourism and Destination Management

2024年4月2日

Dynamic Pricing: A Game-Changer for Sustainable Tourism and Destination Management

Hi travelers! I’m glad to be back with you. It’s been a few weeks since my last post and I guess you’ve noticed on the…

7 条评论
How an AI-powered framework can provide data-driven solutions to balance tourism, economic growth and conservation

2024年1月18日

How an AI-powered framework can provide data-driven solutions to balance tourism, economic growth and conservation

Tourists overwhelming daily life once seemed a big city phenomenon but unfortunately it's a reality in smaller…

4 条评论
Tourism & AI: Preparing for the New Era of European Regulations

2023年12月11日

Tourism & AI: Preparing for the New Era of European Regulations

Artificial intelligence continues to advance rapidly, with systems capable of matching and even surpassing human…

1 条评论

See all articles

Generative AI in Tourism. How to train an LLM (part 2) - Data preprocessing for LLMs

Franklin Carpenter

Ayudo a destinos y hoteles a dise?ar e implementar su estrategia IA

Data Preprocessing for Large Language Models

??What does data processing mean?

?How do we preprocess data? Tips, recommendations and examples

领英推荐

Franklin Carpenter的更多文章

社区洞察

其他会员也浏览了

LLMs, Embeddings, Vector Search and More!

The AI Vanguard Newsletter: Issue #1 - Cutting-Edge Research and a Path To Personal Growth

The Transformative Power of Generative AI in Business Intelligence

AI Starts with a Label: Exploring the World of Data Labeling

Using pre-trained AI algorithms to solve the cold start problem

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Scaling Synthetic Data Creation with 1,000,000,000 Personas: A Paradigm Shift

Top AI/ML Papers of the Week [08/04 - 14/04]

Boost Your Business Efficiency with These AI Tools: Tips and Insights

Artificial Intelligence and Machine Learning

Data Preprocessing for Large Language Models

??What does data processing mean?

?How do we preprocess data? Tips, recommendations and examples

领英推荐

Franklin Carpenter的更多文章

AI + tourism: ?Will hybrid and adaptive AI systems takeover our destinations in the future?

How hybrid AI can bring our tourist places to Life

La importancia de anticipar tendencias en el sector de los viajes

Más allá de la turismofobia: El reto de reinventar la ciudad.

La importancia de los datos en la gestión del turismo en Chile.

A Glimpse into the future: Disruptive AI innovations that could revolutionize tourism.

Los fondos del Royalty Minero: Una oportunidad para impulsar el turismo sustentable en las regiones de Chile

Dynamic Pricing: A Game-Changer for Sustainable Tourism and Destination Management

How an AI-powered framework can provide data-driven solutions to balance tourism, economic growth and conservation

Tourism & AI: Preparing for the New Era of European Regulations

社区洞察

其他会员也浏览了

LLMs, Embeddings, Vector Search and More!

The AI Vanguard Newsletter: Issue #1 - Cutting-Edge Research and a Path To Personal Growth

The Transformative Power of Generative AI in Business Intelligence

AI Starts with a Label: Exploring the World of Data Labeling

Using pre-trained AI algorithms to solve the cold start problem

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Scaling Synthetic Data Creation with 1,000,000,000 Personas: A Paradigm Shift

Top AI/ML Papers of the Week [08/04 - 14/04]

Boost Your Business Efficiency with These AI Tools: Tips and Insights

Artificial Intelligence and Machine Learning