登录查看更多内容

Real-World AI: Why LLM Might Be Lost for Words

Nick Kvasov

AI and Engineering Manager @ Bayanat.AI | MBA

发布日期: 2023年10月28日

Introduction

When companies use Large Language Models (LLMs), they hope these systems can understand and use every word from the information they were taught. But here's the issue: LLMs can't remember every word. They have a word list limit.

What does this mean? Well, imagine you're talking about special topics like Bitcoin, making music, or rock climbing. Words like "hodl," "reverb," or "crux" are important in these areas. But if these words are not in the LLM's word list, the system might just skip over them or get confused.

So, businesses need to pick the right words for their LLMs carefully. This way, when the LLM talks about these topics, it doesn't miss the important words and makes sense to the people it's communicating with.

Vocabulary vs. Dictionary Size in LLMs

Vocabulary

The "vocabulary" of an LLM refers to the set of words that the model has been trained to understand and use. This includes words from various sources and domains that the model uses to interpret input and generate responses.
The vocabulary's breadth ensures that the LLM can handle a wide range of topics and understand diverse linguistic nuances, slang, idioms, and jargon.
However, the vocabulary doesn't specify the relationships between words, their definitions, or their contextual usage rules. It's more about word recognition.

Dictionary

The "dictionary " describes the subset of words that the model not only recognizes but also deeply understands in terms of definitions, usage, and inter-relationships with other words. It's akin to the model's internal reference guide.
The dictionary helps the model disambiguate words, understand context, and generate more coherent, contextually appropriate responses.
The dictionary size is integral to tasks requiring understanding of not just what words are, but what they mean, like translation or text summarization.

Training Datasets: Vocabulary and Dictionary Sizes

The training datasets for prominent Large Language Models (LLMs) are colossal, often encompassing wide corpora of the internet's text. These datasets need to be extensive to capture the diversity of human language. However, the actual vocabulary size — the number of unique words the model is trained on — can vary based on several factors, including the language, the breadth of the training data, and the tokenization method used.

Raw Data: Initially, LLMs like GPT-3 were trained on datasets extracted from a broad range of sources, including books, websites, and other texts, amounting to hundreds of gigabytes or even terabytes of raw text data. The number of unique words can be counted as millions, especially with multi-language data. The next step will help to manage this huge amount of words.

Tokenization: The raw text is then tokenized (split into pieces, often words or parts of words). This is heavily dependent on the selected tokenization technique.

Dictionary Construction: One of the approaches: unique tokens are identified, and a frequency count is conducted. Rare words (those appearing less than a certain number of times in the entire dataset) might be excluded from the final vocabulary. However, there are other approaches to building a dictionary that fits the business task without over bloating the resources required to handle it.

These varying dictionary sizes illustrate the differing strategies and goals of each LLM. A larger vocabulary allows for more nuanced understanding and text generation, but it also requires more computational power to manage effectively. Each of these models represents a different point of balance between these factors, tailored to the specific use cases they were designed to address.

Vocabulary Re-construction:

From the Dictionary most of the frequent words from the training dataset vocabulary can be re-constructed, but not fully, since some parts of unfrequent words might be missing from the dictionary. That might lead to limited training and 'understanding' of 'unusual words'.

Practicality:

Vocabulary/Dictionary Balancing Act

In practice, maintaining a balance between an expansive vocabulary and a comprehensive dictionary is crucial for the efficacy of LLMs. Too much emphasis on vocabulary breadth without deepening the dictionary can lead to shallow performance, while an overly specialized dictionary might limit the model's versatility.

When we talk about the limitations related to the vocabulary of LLMs, these often directly influence the dictionary size. For instance:

Updating Challenges: As languages evolve, adding new words to the model's vocabulary requires updating the dictionary as well, ensuring the LLM recognizes these words and understands their meanings and use cases. This ongoing updating is resource-intensive.
Resource Constraints: A vast dictionary demands significant storage capacity and computational power to sift through during operations, impacting performance and cost.
Depth of Understanding: An extensive vocabulary doesn't always equate to a deep understanding. The model might recognize many words without truly understanding their specific definitions or contextual implications, especially if the dictionary isn't as comprehensive as the vocabulary.
Specialization vs. Generalization: LLMs trained on specialized datasets might have a rich dictionary in specific areas but lack breadth in others. Conversely, models trained on too general a dataset might have broad vocabulary coverage but lack depth in any specific field.

Dictionary Management

Collecting and maintaining dictionaries for LLMs from a dataset's entire vocabulary is crucial to ensure a deep, contextual understanding of words. By combining multiple approaches and continuously refining the dictionary based on real-world feedback and performance, LLM developers can ensure a robust and contextually rich understanding of language, enhancing the model's overall effectiveness.

领英推荐

Small Language Models (SLMs) vs. Large Language Models…

Liquid Technologies 1 个月前

Mini LLM Models: A New Wave in Innovation

Brimma Tech, Inc. 7 个月前

Microsoft’s New Love

AIM Events 1 年前

Here are several approaches to curating and refining dictionaries for LLMs:

Frequency-Based Selection:

Identify words and phrases based on their occurrence frequency in the dataset. Commonly used words are essential for the dictionary, while rare words might be considered less critical. This approach ensures that the LLM recognizes and understands words that users are most likely to use.

Semantic Clustering:

Group words based on their semantic similarities using techniques like word embeddings. This helps in capturing not just individual words but also their interrelationships, ensuring that words with similar meanings or contexts are included.

Domain-Specific Curation:

If the LLM is intended for a specific domain (e.g., medicine, finance), prioritize terms and phrases relevant to that field. Extracting terms from domain-specific glossaries, textbooks, or databases can be beneficial.

Hierarchical Sampling:

Divide the vocabulary into hierarchical categories or topics. Sample words from each category to ensure a balanced representation of all topics in the dictionary.

Expert Review:

Engage domain experts to review and refine the dictionary, adding or removing terms based on real-world relevance and importance. This ensures that the dictionary reflects genuine expertise and covers terms that practitioners in the field deem essential.

Iterative Refinement:

As the LLM is used, continuously refine the dictionary based on performance feedback. Words or phrases that lead to misunderstandings or inaccuracies can be re-evaluated, and the dictionary can be updated accordingly.

Inclusion of Synonyms and Variants:

For every primary word in the dictionary, include its synonyms or variant forms. This ensures a comprehensive understanding of terms and their different usages.

Handling Out-of-Vocabulary (OOV) Words:

Implement strategies to manage words not present in the dictionary. Techniques like subword tokenization can help the LLM process unfamiliar words by breaking them down into known subwords or tokens.

Feedback Loops with Users:

Allow users to flag unfamiliar or misunderstood terms. Incorporate this feedback to expand and refine the dictionary, ensuring that it stays updated with evolving language use.

Cross-Referencing with Established Dictionaries:

Compare the LLM's dictionary with established linguistic dictionaries or domain-specific lexicons. Fill in gaps or validate entries to align with standard language definitions and usages.

Conclusion

Large Language Models (LLMs) are powerful, but they have limits. They know a lot of words, but not all. This means they might not understand everything, especially special terms used in certain jobs or hobbies.

For people in business, this is very important. They need to keep teaching these systems new words so they can understand and talk about different subjects better. It's like a balancing act — having enough words but making sure the system still works fast.

Looking forward, we want these language systems to get even better at talking and understanding. It's more than just teaching them words; it's about helping them communicate clearly. The real success will be when these systems can talk with people easily, almost like talking to a human.

要查看或添加评论，请登录

Nick Kvasov的更多文章

Real-World AI: A Basic PROMPT for Business Problem-Solving using Design Thinking and Business TRIZ

2024年1月13日

Real-World AI: A Basic PROMPT for Business Problem-Solving using Design Thinking and Business TRIZ

Introduction This prompt presents a structured methodology for addressing business challenges by integrating Design…

5 条评论
Real-World AI: Fine Tuning Cheat Sheet

2023年11月19日

Real-World AI: Fine Tuning Cheat Sheet

Introduction The table provides a conceptual framework for evaluating the effectiveness of various fine-tuning methods…

2 条评论
Real-World AI: RAGs don't syncopate

2023年11月17日

Real-World AI: RAGs don't syncopate

Introduction Retrieval-Augmented Generation, or RAG, combines machine learning, language processing, and data retrieval…
Real-World AI: The Encoder-Decoder Game in AI

2023年11月14日

Real-World AI: The Encoder-Decoder Game in AI

Introduction Large Language Models (LLMs) are a big deal in today's tech world, especially in how they help machines…
Real-World AI: The Prompt Awakens - Navigating Latent Space

2023年11月12日

Real-World AI: The Prompt Awakens - Navigating Latent Space

Introduction In a world of LLMs, one concept stands as a basis for the most Prompt Engineering Techniques: Latent Space…

1 条评论
Real-World AI: RAG to Riches - Elevating LLMs to Specialized Domains

2023年11月3日

Real-World AI: RAG to Riches - Elevating LLMs to Specialized Domains

Introduction Let’s imagine the realm of rare bird species conservation. Here, analysis of vast textual data like…
Real-World AI: Directed Ideation Staged NASA Way

2023年10月31日

Real-World AI: Directed Ideation Staged NASA Way

Introduction NASA's BIDARA, which stands for Bio-Inspired Design and Research Assistant, is a ChatGPT-based chatbot…

1 条评论
Real-World AI: It's always the right time for a Thought, don't you think?

2023年10月30日

Real-World AI: It's always the right time for a Thought, don't you think?

Introduction In the world of artificial intelligence, Large Language Models (LLMs) like ChatGPT have made a significant…

5 条评论
Real-World AI: A Mentor Machines

2023年10月29日

Real-World AI: A Mentor Machines

Introduction: Large Language Models (LLMs) offer significant value in this domain but come with high computational…

5 条评论
Real-World AI: Magic Library of Public Datasets

2023年10月23日

Real-World AI: Magic Library of Public Datasets

Introduction Smart decision-making relies on facts and figures. A special tool, known as AI, is a big help in this…

See all articles

Real-World AI: Why LLM Might Be Lost for Words

Nick Kvasov

AI and Engineering Manager @ Bayanat.AI | MBA

Introduction

Vocabulary vs. Dictionary Size in LLMs

Vocabulary

Dictionary

Training Datasets: Vocabulary and Dictionary Sizes

Vocabulary Re-construction:

Practicality:

Vocabulary/Dictionary Balancing Act

Dictionary Management

领英推荐

Conclusion

Nick Kvasov的更多文章

社区洞察

其他会员也浏览了

How To Use Prompt Engineering With Large Language Models

Large Language Model Market Seen Soaring 33.8% Growth to Reach USD 66.04 Billion by 2032

The Comparative Edge: Small vs. Large Language Models in AI

Large Language Models (LLMs) and It's Benefits to Businesses

What are large language models?

SLM and LLM... My Top 10 in July 2024

Retrieval-Augmented Generation (RAG) and Agentic RAG

The Intriguing World of Large Language Models: 8 Eye-opening Claims

Large Language Models

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Introduction

Vocabulary vs. Dictionary Size in LLMs

Vocabulary

Dictionary

Training Datasets: Vocabulary and Dictionary Sizes

Vocabulary Re-construction:

Practicality:

Vocabulary/Dictionary Balancing Act

Dictionary Management

领英推荐

Conclusion

Nick Kvasov的更多文章

Real-World AI: A Basic PROMPT for Business Problem-Solving using Design Thinking and Business TRIZ

Real-World AI: Fine Tuning Cheat Sheet

Real-World AI: RAGs don't syncopate

Real-World AI: The Encoder-Decoder Game in AI

Real-World AI: The Prompt Awakens - Navigating Latent Space

Real-World AI: RAG to Riches - Elevating LLMs to Specialized Domains

Real-World AI: Directed Ideation Staged NASA Way

Real-World AI: It's always the right time for a Thought, don't you think?

Real-World AI: A Mentor Machines

Real-World AI: Magic Library of Public Datasets

社区洞察

其他会员也浏览了

How To Use Prompt Engineering With Large Language Models

Large Language Model Market Seen Soaring 33.8% Growth to Reach USD 66.04 Billion by 2032

The Comparative Edge: Small vs. Large Language Models in AI

Large Language Models (LLMs) and It's Benefits to Businesses

What are large language models?

SLM and LLM... My Top 10 in July 2024

Retrieval-Augmented Generation (RAG) and Agentic RAG

The Intriguing World of Large Language Models: 8 Eye-opening Claims

Large Language Models

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance