Math Behind Large Language Models Explained

Have you ever chatted with an AI like ChatGPT or DeepSeek and wondered how it seems to "understand" you? It can write stories, answer questions, or even sound like a friend—but there’s no magic here. It’s all math! At the heart of these Large Language Models (LLMs) is a system called the Transformer, powered by simple ideas like vectors, dot products, and a clever trick called "attention."

Let’s break it down step-by-step so you can see how words turn into numbers and back into words again—even if you’re new to computer science or just know some high school math.

Words as Numbers: Vectors

LLMs don’t read words like we do. Instead, they turn every word into a vector—a list of numbers. Think of it like a secret code. For example:

  • "Cat" might become [0.2, -0.1, 0.5, ...] with 768 numbers.
  • "Dog" might be [0.3, 0.0, 0.4, ...].

These aren’t random numbers! They’re carefully crafted so similar words (like "cat" and "dog") have vectors that are close together, while different ones (like "cat" and "car") are farther apart. This process, called embedding, is trained on billions of sentences to capture meaning—like a map of words in a 768-dimensional world. For now, just picture vectors as "number versions" of words.

The Transformer: The Brain of LLMs

The Transformer is the math engine driving LLMs. It’s like a two-part recipe:

Attention: Figuring out which words matter most to each other.

Prediction: Guessing the next word based on that.

Attention is the star of the show, so let’s zoom in there. Imagine the sentence "The cat chased the mouse." When the model looks at "chased," it needs to decide: should it focus on "the," "cat," or "mouse"? Attention gives each word a score to show how important it is, making the model smarter about context.

Attention: Connecting the Dots

Attention is how LLMs link words together. It uses three special vectors for each word:

  • Query (Q): What the word is "asking" (e.g., “What does ‘chased’ relate to?”).
  • Key (K): What the word "offers" (e.g., “I’m ‘cat’—here’s my info”).
  • Value (V): The actual information to share if relevant (e.g., “Here’s what ‘cat’ means”).

These vectors start as the word’s embedding but get tweaked with some multiplication (using learned numbers called weights). Let’s see how it works with math you might recognize.

Step 1: Dot Product for Similarity

To figure out how much "chased" cares about "cat," we use the dot product. You might remember this from math class: multiply pairs of numbers and add them up. For two vectors [a,b] and [c,d] :

                      Dot product = (a × c) + (b × d)        

  • Query for "chased": [0.1, 0.2]
  • Key for "cat": [0.3, 0.4]
  • Dot product: 0.1?0.3+0.2?0.4 = 0.03 + 0.08 = 0.11

A bigger dot product means "chased" and "cat" are more connected. The model does this for every pair—like "chased" with "the" or "mouse"—to get a list of scores.

Step 2: Scaling Down

In real LLMs, vectors have 768 numbers, so dot products can get huge (think 50 or 100!). Big numbers mess up the next step, so we shrink them by dividing by the square root of the vector size. If the size is 2:

Scaled score = 0.11 / √2 ≈ 0.078        

Step 3: Softmax—Turning Scores into Probabilities

Next, we turn these scores into "weights" that add up to 1, like probabilities. This is done with softmax, a math trick that exaggerates larger values and suppresses smaller ones:

  1. For scores [0.078, 0.05, 0.02] (say, for "cat," "the," "mouse"):
  2. Raise e (about 2.718) to each: [e^0.078, e^0.05, e^0.02] ≈ [1.081, 1.051, 1.020].
  3. Sum them: 1.081+1.051+1.020=3.152
  4. Normalize(Divide each by the sum): [0.343, 0.333, 0.324].

Now, "chased" gives 34.3% attention to "cat," 33.3% to "the," and 32.4% to "mouse."

Step 4: Mixing the Values

Each word has a Value vector (V). Multiply each by its weight and add them:

  • "Cat” V: [0.2, 0.3] × weight 0.343 → [0.069, 0.103]
  • “The” V: [0.1, 0.1] × weight 0.333 → [0.033, 0.033]
  • “Mouse” V: [0.4, 0.5] × weight 0.324 → [0.130, 0.162]
  • Sum: [0.069+0.033+0.130, 0.103+0.033+0.162] = [0.232,0.298]

This new vector [0.232, 0.298] becomes the updated vector for “chased” with context!

Putting It Together

The Transformer applies this process to every word, generating an attention matrix (e.g., 5×5 for a sentence with 5 words). Then:

  • Layers: Repeating the process in multiple layers refines the vectors.
  • Multi-Head Attention: Splitting vectors into heads to capture different relationships (e.g., grammar vs. meaning).
  • Prediction: The final vectors predict the next word probabilistically.

Why This Math Works

  • Dot Product: Spots how similar words are—like "chased" and "cat" lining up.
  • Softmax: Balances attention so the model focuses just right.
  • Vectors: Pack tons of meaning into numbers that math can tweak.

It’s all basic algebra: multiplication, addition, and a bit of e e e. No fancy calculus—just lots of number-crunching!

Real-World Magic

When processing “The cat chased the mouse,” the Transformer learns to connect “chased” to “cat” and “mouse,” downplaying “the.” After training on billions of sentences, this math enables LLMs to write essays, solve problems, or chat with you naturally—all from vectors dancing together.

Next time you use an LLM, think: behind every word is a vector, and behind every answer is a dance of numbers. Pretty cool with only some high school math?

Try It Yourself

Here’s a tiny taste in Python with NumPy:

import numpy as np
a = np.array([1, 2])
b = np.array([3, 4])
print(a.dot(b))  # 11        

This is the start of attention—multiply and add!

Now you’ve peeked under the hood—math isn’t just for textbooks; it’s the language of AI! ??

Note: This article simplifies details like positional encodings (which help track word order) and feed-forward layers. For more, check out the original paper Attention Is All You Need or Andrej Karpathy’s YouTube tutorials.

要查看或添加评论,请登录

Heidi N.的更多文章

社区洞察