登录查看更多内容

Deciphering the Mythology of Language Models: Word Embeddings

Angel Salazar, PhD

Founder, AI Cognitia

发布日期: 2024年4月15日

By Angel Salazar PhD

Let's start with CHATGPT, which is a Large language model (LLM), initially based on the GPT 3.5 model architecture and trained with texts containing hundreds of millions of words, and with billions of parameters in its neural network models.

Imagine the TV programme "The Chase," where the expert knows a great deal because he/she has read all the Guinness Record books and encyclopaedias. He/She knows everything, but, in the case of our Artificial Intelligent counterpart, it is simply memorised knowledge, not reasoning; it's merely about estimating the next word, even the next sentence, through millions of occurrences of the same text pattern, by brute force.

The first version of ChatGPT could ingest 2000 words per query, which has been increased to 4000. LLMs like Claude can handle up to a 200,000-word capacity per question, so it can already process an entire book in one go.

The ongoing challenge is how we effectively interact with the LLM, that is, how we ask the right questions in the right format. Asking questions, which is called "prompting", has become a science, or rather a pseudo-science. To interact effectively with an LLM, we must ask in standard English, as OpenAI has used Wikipedia as training data. If you go with another structure, the LLM may get confused.

LLMs have the ability to generate text based on the comparison of the question you ask with the text in the LLM's "memory." Specifically, the LLM compares the semantic structure of the question against the stored information that has been vectorised previously.

Semantic Structure based on Vectors:

When we talk about semantic structure, we refer to the vectors used by LLMs like ChatGPT. These are not one, two (2D), three (3D), or four dimensions of vectors. They are thousands, or even tens of thousands of dimensions. Models based on the GPT 3 architecture use about 1,280 dimensions, GPT4 uses 16,384. In a simplified scenario, let's vectorise a word in low-dimensional spaces, with only five dimensions, to illustrate how word embeddings work.

Illustrative Example of Word Embeddings:

Let's say we want to create a 5-dimensional vector representation for the word "tree." We could consider aspects such as semantic and grammatical categories and usage frequency. Each dimension in our vector could represent:

Semantic Category: Assigning numbers to different categories (e.g., 0 for objects, 1 for living beings, 2 for places). Since "tree" is a living being, it could be 1.

Bernard Marr 4 个月前

Ahead of AI #10: State of Computer Vision 2023

Sebastian Raschka, PhD 1 年前

The AI Vanguard Newsletter #5

Danny Butvinik 1 年前

Commonality: Classifying from 0 to 1, where 1 represents very common words and 0 represents rare words. "Tree" is quite common, so we could assign a value like 0.7.

Word Length: Simply the number of characters. "Tree" has 5 characters.

Part of Speech: Assigning numbers to represent different grammatical categories (e.g., 0 for nouns, 1 for verbs, etc.). "Tree" is a noun, so it could be 0.

Sentiment: If we associate numbers with sentiment (e.g., -1 for negative, 0 for neutral, 1 for positive), "tree" might be considered neutral, so it could be 0.

Thus, the 5-dimensional vector for "tree" might look something like this: [1, 0.7, 5, 0, 0].

The vector [1, 0.7, 5, 0, 0] numerically represents the word "tree," indicating that it is a common living being (1, 0.7), and has five letters, is a noun and has a neutral sentiment (5, 0, 0).

This is a very simplified example, and in real applications, vector dimensions (i.e., embeddings) are derived from complex models that capture a broader context and relationships between words.

I hope this gives you an idea of how different aspects of a word could be quantified into numerical values in a natural language vector (i.e., embedding).

#LanguageModels #AI #ChatGPT #NeuralNetworks #MachineLearning #DeepLearning #TechInnovation #DataScience #ArtificialIntelligence #GPT3 #GPT4 #NLP #NaturalLanguageProcessing #Technology #TechTrends #BigData #AIResearch #SemanticAnalysis #VectorEmbeddings #WordEmbeddings #TextGeneration #AIWriting #OpenAI #TechCommunity #SmartTechnology #TechUpdates #FutureOfAI #AIInsights #TechDevelopment #AIEducation #DigitalTransformation #TechSavvy #AIPower #AIandEthics #AIForGood #ResponsibleAI #AITechnology #MachineIntelligence #CognitiveComputing #AIApplications #AILearning #AIDisruption #AIFuture #TechImpact #AITools #AITrends #AIAnalysis #AIStrategy #TechProgress #AIAdvancement #DeepTech #EmergingTech #InnovativeTech #TechRevolution #AIRevolution #TechExploration #AIThinking

Ben Holt

CEO, Senshine Ltd

5 个月

Nice - love this accessible explanation of how natural language processing translates words into vectors.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Deciphering the Mythology of Language Models: Word Embeddings

Angel Salazar, PhD

Founder, AI Cognitia

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Large Language Models - The Hardware Connection

Powerful Artificial Intelligence ChatGPT

ChatGPT: Beyond The Curious Beast

Artificial Intelligence: A Double-Edged Sword in the World of Information and Misinformation

The Human Impersonator: Language, AI and GPT-3

We need to Rethink Chain-of-Thought (CoT) prompting - AI&YOU #68

ChatGPT and CFD

Hello IP World! Gemini 1.5 is your new innovation Wingman

Artificial Intelligence: #9 AI’s parrot problem

The Noun-Phrase Dominance Model: A Proposed Solution to LLM Hallucinations

领英推荐

How Generative AI can augment—not replace—Chief Technology Officers And Software Developers

2024年7月25日

From Early AI to Modern Large Language Models

2024年6月20日

An Ecosystem Perspective for Supercharging the UK Technology Sector

2024年6月17日

Fatherhood in 2024

2024年6月16日

TWIMLcon 2021: DAY 1 REVIEW

2021年1月20日

LIVING IN THE NOW - VERSUS - LIVING AT THE EDGE

2020年12月13日

Platforms, Platforms, and lastly, Platforms

2019年7月11日

Graphs for Business Research and Commercial Applications

2019年6月21日

Digital Data-Centric Platforms in the Age of IoT and AI

2019年6月11日

How to become a 'Pythonista' if you are a busy quantitative researcher

2017年9月27日