Real-World AI: Why LLM Might Be Lost for Words
Introduction
When companies use Large Language Models (LLMs), they hope these systems can understand and use every word from the information they were taught. But here's the issue: LLMs can't remember every word. They have a word list limit.
What does this mean? Well, imagine you're talking about special topics like Bitcoin, making music, or rock climbing. Words like "hodl," "reverb," or "crux" are important in these areas. But if these words are not in the LLM's word list, the system might just skip over them or get confused.
So, businesses need to pick the right words for their LLMs carefully. This way, when the LLM talks about these topics, it doesn't miss the important words and makes sense to the people it's communicating with.
Vocabulary vs. Dictionary Size in LLMs
Vocabulary
Dictionary
Training Datasets: Vocabulary and Dictionary Sizes
The training datasets for prominent Large Language Models (LLMs) are colossal, often encompassing wide corpora of the internet's text. These datasets need to be extensive to capture the diversity of human language. However, the actual vocabulary size — the number of unique words the model is trained on — can vary based on several factors, including the language, the breadth of the training data, and the tokenization method used.
Raw Data: Initially, LLMs like GPT-3 were trained on datasets extracted from a broad range of sources, including books, websites, and other texts, amounting to hundreds of gigabytes or even terabytes of raw text data. The number of unique words can be counted as millions, especially with multi-language data. The next step will help to manage this huge amount of words.
Tokenization: The raw text is then tokenized (split into pieces, often words or parts of words). This is heavily dependent on the selected tokenization technique.
Dictionary Construction: One of the approaches: unique tokens are identified, and a frequency count is conducted. Rare words (those appearing less than a certain number of times in the entire dataset) might be excluded from the final vocabulary. However, there are other approaches to building a dictionary that fits the business task without over bloating the resources required to handle it.
These varying dictionary sizes illustrate the differing strategies and goals of each LLM. A larger vocabulary allows for more nuanced understanding and text generation, but it also requires more computational power to manage effectively. Each of these models represents a different point of balance between these factors, tailored to the specific use cases they were designed to address.
Vocabulary Re-construction:
From the Dictionary most of the frequent words from the training dataset vocabulary can be re-constructed, but not fully, since some parts of unfrequent words might be missing from the dictionary. That might lead to limited training and 'understanding' of 'unusual words'.
Practicality:
Vocabulary/Dictionary Balancing Act
In practice, maintaining a balance between an expansive vocabulary and a comprehensive dictionary is crucial for the efficacy of LLMs. Too much emphasis on vocabulary breadth without deepening the dictionary can lead to shallow performance, while an overly specialized dictionary might limit the model's versatility.
When we talk about the limitations related to the vocabulary of LLMs, these often directly influence the dictionary size. For instance:
Dictionary Management
Collecting and maintaining dictionaries for LLMs from a dataset's entire vocabulary is crucial to ensure a deep, contextual understanding of words. By combining multiple approaches and continuously refining the dictionary based on real-world feedback and performance, LLM developers can ensure a robust and contextually rich understanding of language, enhancing the model's overall effectiveness.
领英推荐
Here are several approaches to curating and refining dictionaries for LLMs:
Frequency-Based Selection:
Identify words and phrases based on their occurrence frequency in the dataset. Commonly used words are essential for the dictionary, while rare words might be considered less critical. This approach ensures that the LLM recognizes and understands words that users are most likely to use.
Semantic Clustering:
Group words based on their semantic similarities using techniques like word embeddings. This helps in capturing not just individual words but also their interrelationships, ensuring that words with similar meanings or contexts are included.
Domain-Specific Curation:
If the LLM is intended for a specific domain (e.g., medicine, finance), prioritize terms and phrases relevant to that field. Extracting terms from domain-specific glossaries, textbooks, or databases can be beneficial.
Hierarchical Sampling:
Divide the vocabulary into hierarchical categories or topics. Sample words from each category to ensure a balanced representation of all topics in the dictionary.
Expert Review:
Engage domain experts to review and refine the dictionary, adding or removing terms based on real-world relevance and importance. This ensures that the dictionary reflects genuine expertise and covers terms that practitioners in the field deem essential.
Iterative Refinement:
As the LLM is used, continuously refine the dictionary based on performance feedback. Words or phrases that lead to misunderstandings or inaccuracies can be re-evaluated, and the dictionary can be updated accordingly.
Inclusion of Synonyms and Variants:
For every primary word in the dictionary, include its synonyms or variant forms. This ensures a comprehensive understanding of terms and their different usages.
Handling Out-of-Vocabulary (OOV) Words:
Implement strategies to manage words not present in the dictionary. Techniques like subword tokenization can help the LLM process unfamiliar words by breaking them down into known subwords or tokens.
Feedback Loops with Users:
Allow users to flag unfamiliar or misunderstood terms. Incorporate this feedback to expand and refine the dictionary, ensuring that it stays updated with evolving language use.
Cross-Referencing with Established Dictionaries:
Compare the LLM's dictionary with established linguistic dictionaries or domain-specific lexicons. Fill in gaps or validate entries to align with standard language definitions and usages.
Conclusion
Large Language Models (LLMs) are powerful, but they have limits. They know a lot of words, but not all. This means they might not understand everything, especially special terms used in certain jobs or hobbies.
For people in business, this is very important. They need to keep teaching these systems new words so they can understand and talk about different subjects better. It's like a balancing act — having enough words but making sure the system still works fast.
Looking forward, we want these language systems to get even better at talking and understanding. It's more than just teaching them words; it's about helping them communicate clearly. The real success will be when these systems can talk with people easily, almost like talking to a human.