登录查看更多内容

Math Behind Large Language Models Explained

Heidi N.

DevSecOps Engineer | Paas| IaC| Automation| Microservices | Java, AWS, Docker, Kubernetes| AWS EKS | CI/CD | Data and GenAI| Mathematics | Team Leader | Learner| Thinker| Problem Solver

发布日期: 2025年3月12日

Have you ever chatted with an AI like ChatGPT or DeepSeek and wondered how it seems to "understand" you? It can write stories, answer questions, or even sound like a friend—but there’s no magic here. It’s all math! At the heart of these Large Language Models (LLMs) is a system called the Transformer, powered by simple ideas like vectors, dot products, and a clever trick called "attention."

Let’s break it down step-by-step so you can see how words turn into numbers and back into words again—even if you’re new to computer science or just know some high school math.

Words as Numbers: Vectors

LLMs don’t read words like we do. Instead, they turn every word into a vector—a list of numbers. Think of it like a secret code. For example:

"Cat" might become [0.2, -0.1, 0.5, ...] with 768 numbers.
"Dog" might be [0.3, 0.0, 0.4, ...].

These aren’t random numbers! They’re carefully crafted so similar words (like "cat" and "dog") have vectors that are close together, while different ones (like "cat" and "car") are farther apart. This process, called embedding, is trained on billions of sentences to capture meaning—like a map of words in a 768-dimensional world. For now, just picture vectors as "number versions" of words.

The Transformer: The Brain of LLMs

The Transformer is the math engine driving LLMs. It’s like a two-part recipe:

Attention: Figuring out which words matter most to each other.

Prediction: Guessing the next word based on that.

Attention is the star of the show, so let’s zoom in there. Imagine the sentence "The cat chased the mouse." When the model looks at "chased," it needs to decide: should it focus on "the," "cat," or "mouse"? Attention gives each word a score to show how important it is, making the model smarter about context.

Attention: Connecting the Dots

Attention is how LLMs link words together. It uses three special vectors for each word:

Query (Q): What the word is "asking" (e.g., “What does ‘chased’ relate to?”).
Key (K): What the word "offers" (e.g., “I’m ‘cat’—here’s my info”).
Value (V): The actual information to share if relevant (e.g., “Here’s what ‘cat’ means”).

These vectors start as the word’s embedding but get tweaked with some multiplication (using learned numbers called weights). Let’s see how it works with math you might recognize.

Step 1: Dot Product for Similarity

To figure out how much "chased" cares about "cat," we use the dot product. You might remember this from math class: multiply pairs of numbers and add them up. For two vectors [a,b] and [c,d] :

                      Dot product = (a × c) + (b × d)

Query for "chased": [0.1, 0.2]
Key for "cat": [0.3, 0.4]
Dot product: 0.1?0.3+0.2?0.4 = 0.03 + 0.08 = 0.11

A bigger dot product means "chased" and "cat" are more connected. The model does this for every pair—like "chased" with "the" or "mouse"—to get a list of scores.

Step 2: Scaling Down

In real LLMs, vectors have 768 numbers, so dot products can get huge (think 50 or 100!). Big numbers mess up the next step, so we shrink them by dividing by the square root of the vector size. If the size is 2:

Scaled score = 0.11 / √2 ≈ 0.078

Step 3: Softmax—Turning Scores into Probabilities

Next, we turn these scores into "weights" that add up to 1, like probabilities. This is done with softmax, a math trick that exaggerates larger values and suppresses smaller ones:

For scores [0.078, 0.05, 0.02] (say, for "cat," "the," "mouse"):
Raise e (about 2.718) to each: [e^0.078, e^0.05, e^0.02] ≈ [1.081, 1.051, 1.020].
Sum them: 1.081+1.051+1.020=3.152
Normalize(Divide each by the sum): [0.343, 0.333, 0.324].

Now, "chased" gives 34.3% attention to "cat," 33.3% to "the," and 32.4% to "mouse."

Step 4: Mixing the Values

Each word has a Value vector (V). Multiply each by its weight and add them:

"Cat” V: [0.2, 0.3] × weight 0.343 → [0.069, 0.103]
“The” V: [0.1, 0.1] × weight 0.333 → [0.033, 0.033]
“Mouse” V: [0.4, 0.5] × weight 0.324 → [0.130, 0.162]
Sum: [0.069+0.033+0.130, 0.103+0.033+0.162] = [0.232,0.298]

This new vector [0.232, 0.298] becomes the updated vector for “chased” with context!

Putting It Together

The Transformer applies this process to every word, generating an attention matrix (e.g., 5×5 for a sentence with 5 words). Then:

Layers: Repeating the process in multiple layers refines the vectors.
Multi-Head Attention: Splitting vectors into heads to capture different relationships (e.g., grammar vs. meaning).
Prediction: The final vectors predict the next word probabilistically.

Why This Math Works

Dot Product: Spots how similar words are—like "chased" and "cat" lining up.
Softmax: Balances attention so the model focuses just right.
Vectors: Pack tons of meaning into numbers that math can tweak.

It’s all basic algebra: multiplication, addition, and a bit of e e e. No fancy calculus—just lots of number-crunching!

Real-World Magic

When processing “The cat chased the mouse,” the Transformer learns to connect “chased” to “cat” and “mouse,” downplaying “the.” After training on billions of sentences, this math enables LLMs to write essays, solve problems, or chat with you naturally—all from vectors dancing together.

Next time you use an LLM, think: behind every word is a vector, and behind every answer is a dance of numbers. Pretty cool with only some high school math?

Try It Yourself

Here’s a tiny taste in Python with NumPy:

import numpy as np
a = np.array([1, 2])
b = np.array([3, 4])
print(a.dot(b))  # 11

This is the start of attention—multiply and add!

Now you’ve peeked under the hood—math isn’t just for textbooks; it’s the language of AI! ??

Note: This article simplifies details like positional encodings (which help track word order) and feed-forward layers. For more, check out the original paper Attention Is All You Need or Andrej Karpathy’s YouTube tutorials.

要查看或添加评论，请登录

Heidi N.的更多文章

The Magic Number 5 in AWS

2025年3月21日

The Magic Number 5 in AWS

The number five is indeed magical! it appears in nature (five fingers on each hand, five senses) and even in iconic…

1 条评论
How to Use AI to Ace the AWS Solutions Architect Associate (SAA-C03) Exam

2025年2月5日

How to Use AI to Ace the AWS Solutions Architect Associate (SAA-C03) Exam

Preparing for the AWS Solutions Architect Associate (SAA-C03) exam can be challenging, but integrating artificial…
The Learning Model Behind DeepSeek R1

2025年2月3日

The Learning Model Behind DeepSeek R1

Recently, DeepSeek's R1 model made a buzz in the technology sector. What’s its secret sauce? Reinforcement Learning…

3 条评论
AWS VPC Endpoints Demystified: Key Differences and Exam Insights

2024年11月26日

AWS VPC Endpoints Demystified: Key Differences and Exam Insights

Introduction AWS provides VPC endpoints to securely connect your VPC to AWS services without exposing traffic to the…
Understanding Kubernetes Logging Architecture

2024年11月15日

Understanding Kubernetes Logging Architecture

Introduction Application logs are essential for gaining insights into the inner workings of applications, particularly…
Understanding Kube-Proxy: A Deep Dive into Kubernetes Networking

2024年11月12日

Understanding Kube-Proxy: A Deep Dive into Kubernetes Networking

Introduction to Kubernetes Networking Kubernetes is a complex system that manages containerized applications, and…
An Overview to Kubernetes Client Library

2024年10月31日

An Overview to Kubernetes Client Library

The client-go is the official client library for the Kubernetes programming interface, designed to interact with…
Kubernetes Architecture: A Deep Dive

2024年10月30日

Kubernetes Architecture: A Deep Dive

Introduction Kubernetes has become the backbone of modern cloud-native applications, thanks to its flexible, scalable…
Managing Ingress Traffic and Service Mesh with the Gateway API

2024年10月27日

Managing Ingress Traffic and Service Mesh with the Gateway API

Background In Kubernetes’ original design, Ingress and Service resources were created with the assumption that…
Extending Kubernetes with Custom Resource Definitions: A Guide to CRDs

2024年10月26日

Extending Kubernetes with Custom Resource Definitions: A Guide to CRDs

Kubernetes is a powerful platform for automating the deployment, scaling, and management of containerized applications.…

See all articles

Words as Numbers: Vectors

The Transformer: The Brain of LLMs

Attention: Connecting the Dots

Step 1: Dot Product for Similarity

Step 2: Scaling Down

Step 3: Softmax—Turning Scores into Probabilities

Step 4: Mixing the Values

Putting It Together

Why This Math Works

Real-World Magic

Try It Yourself

Heidi N.的更多文章

The Magic Number 5 in AWS

How to Use AI to Ace the AWS Solutions Architect Associate (SAA-C03) Exam

The Learning Model Behind DeepSeek R1

AWS VPC Endpoints Demystified: Key Differences and Exam Insights

Understanding Kubernetes Logging Architecture

Understanding Kube-Proxy: A Deep Dive into Kubernetes Networking

An Overview to Kubernetes Client Library

Kubernetes Architecture: A Deep Dive

Managing Ingress Traffic and Service Mesh with the Gateway API

Extending Kubernetes with Custom Resource Definitions: A Guide to CRDs

社区洞察