Deciphering the Mythology of Language Models: Word Embeddings

Deciphering the Mythology of Language Models: Word Embeddings

By Angel Salazar PhD


Let's start with CHATGPT, which is a Large language model (LLM), initially based on the GPT 3.5 model architecture and trained with texts containing hundreds of millions of words, and with billions of parameters in its neural network models.

Imagine the TV programme "The Chase," where the expert knows a great deal because he/she has read all the Guinness Record books and encyclopaedias. He/She knows everything, but, in the case of our Artificial Intelligent counterpart, it is simply memorised knowledge, not reasoning; it's merely about estimating the next word, even the next sentence, through millions of occurrences of the same text pattern, by brute force.

The first version of ChatGPT could ingest 2000 words per query, which has been increased to 4000. LLMs like Claude can handle up to a 200,000-word capacity per question, so it can already process an entire book in one go.

The ongoing challenge is how we effectively interact with the LLM, that is, how we ask the right questions in the right format. Asking questions, which is called "prompting", has become a science, or rather a pseudo-science. To interact effectively with an LLM, we must ask in standard English, as OpenAI has used Wikipedia as training data. If you go with another structure, the LLM may get confused.

LLMs have the ability to generate text based on the comparison of the question you ask with the text in the LLM's "memory." Specifically, the LLM compares the semantic structure of the question against the stored information that has been vectorised previously.

Semantic Structure based on Vectors:

When we talk about semantic structure, we refer to the vectors used by LLMs like ChatGPT. These are not one, two (2D), three (3D), or four dimensions of vectors. They are thousands, or even tens of thousands of dimensions. Models based on the GPT 3 architecture use about 1,280 dimensions, GPT4 uses 16,384. In a simplified scenario, let's vectorise a word in low-dimensional spaces, with only five dimensions, to illustrate how word embeddings work.

Illustrative Example of Word Embeddings:

Let's say we want to create a 5-dimensional vector representation for the word "tree." We could consider aspects such as semantic and grammatical categories and usage frequency. Each dimension in our vector could represent:

Semantic Category: Assigning numbers to different categories (e.g., 0 for objects, 1 for living beings, 2 for places). Since "tree" is a living being, it could be 1.

Commonality: Classifying from 0 to 1, where 1 represents very common words and 0 represents rare words. "Tree" is quite common, so we could assign a value like 0.7.

Word Length: Simply the number of characters. "Tree" has 5 characters.

Part of Speech: Assigning numbers to represent different grammatical categories (e.g., 0 for nouns, 1 for verbs, etc.). "Tree" is a noun, so it could be 0.

Sentiment: If we associate numbers with sentiment (e.g., -1 for negative, 0 for neutral, 1 for positive), "tree" might be considered neutral, so it could be 0.

Thus, the 5-dimensional vector for "tree" might look something like this: [1, 0.7, 5, 0, 0].

The vector [1, 0.7, 5, 0, 0] numerically represents the word "tree," indicating that it is a common living being (1, 0.7), and has five letters, is a noun and has a neutral sentiment (5, 0, 0).

This is a very simplified example, and in real applications, vector dimensions (i.e., embeddings) are derived from complex models that capture a broader context and relationships between words.

I hope this gives you an idea of how different aspects of a word could be quantified into numerical values in a natural language vector (i.e., embedding).



#LanguageModels #AI #ChatGPT #NeuralNetworks #MachineLearning #DeepLearning #TechInnovation #DataScience #ArtificialIntelligence #GPT3 #GPT4 #NLP #NaturalLanguageProcessing #Technology #TechTrends #BigData #AIResearch #SemanticAnalysis #VectorEmbeddings #WordEmbeddings #TextGeneration #AIWriting #OpenAI #TechCommunity #SmartTechnology #TechUpdates #FutureOfAI #AIInsights #TechDevelopment #AIEducation #DigitalTransformation #TechSavvy #AIPower #AIandEthics #AIForGood #ResponsibleAI #AITechnology #MachineIntelligence #CognitiveComputing #AIApplications #AILearning #AIDisruption #AIFuture #TechImpact #AITools #AITrends #AIAnalysis #AIStrategy #TechProgress #AIAdvancement #DeepTech #EmergingTech #InnovativeTech #TechRevolution #AIRevolution #TechExploration #AIThinking


Ben Holt

CEO, Senshine Ltd

5 个月

Nice - love this accessible explanation of how natural language processing translates words into vectors.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了