ChatGPT - How it really works
Vivek Kumar
Post Sales Leader- Irrationally passionate about customer flourishing. Helps SaaS firms thrive by reducing churn, increasing revenue, boosting adoption, & building lasting customer relationships
The GPT in ChatGPT stands for "Generative Pre-Trained Transformer" and is a language model that has gained widespread popularity for its ability to generate human-like text.
The Story of ChatGPT is not just the story of generative AI, it's also a story of the "random walk" of technology and the progress of science and philosophy. It is the product of the fortuituous convergence of the latest neural network technology and the availability of zettabytes of data on the internet, culminating in a burst of sudden progress. But how does it really work ? What's going on inside ChatGPT's mind?
ChatGPT is based on the concept of neural nets, originally invented in the 1940s as an idealization of the operation of the human brain. Birds inspired aeroplanes, burdock plants inspired velcro, its only natural that brain cells (neurons) be the inspiration for "intelligent machines".
What makes them so useful is that they can, in principle, do all sorts of tasks, and can be incrementally trained from test data to do those tasks. For example when we make a neural net distinguish "cats" from "dogs" we don't effectively have to write a program that explicitly says finds "whiskers" or "pointy ears"; instead, we just show it examples of cats & dogs, then have the network machine "learn" how to distinguish them. The trained network "generalizes" from the examples it is shown.
At its core, ChatGPT is just adding one word at a time. What it's doing is trying to produce a “reasonable continuation,” given the text so far.
Say we’ve got the text “The best thing about AI is its ability to". Now imagine scanning billions of pages of text on the web, digitized books, etc. & finding all instances of this text. Then seeing what word comes next what percentage of the time. In reality, ChatGPT doesn’t look at the literal text; it looks for things that in a certain sense “match in meaning”. But the end result is that it produces a "ranked list of words" that might follow, together with probabilities.
At each step, it gets a list of words with associated probabilities. The question is where do the probabilities come from? Let’s consider generating English text one letter (rather than one word) at a time. How can we work out what the probability for each letter should be? Take a sample of English text, and calculate how often different letters occur in it. For example, the image below displays letter counts based on articles on “cats" & dogs on Wikipedia
If we take a large enough sample of English text we can expect to eventually get at least fairly consistent results:
Now, instead of a single letter, here’s a plot that shows the probabilities of pairs of letters, a "2-gram" in typical English text. The possible first letters are shown across the page, the second letters down the page. Human language is not just a random jumble of words. it has basic features-there are grammatical rules, there are syntactical rules, and there is an underlying structure to it. For example: a "q" is generally followed by a "u". In the "2-gram below we see that the “q” column is blank (zero probability) except on the “u” row.
Ultimately, we have to formulate everything in terms of numbers. One way to do this is to assign a unique number to each of the 40,000 or so common words in English. For example, “the” might be 914, & “cat” might be 3542. These are the actual numbers used by chatGPT. Here is what ChatGPT produces as the raw embedding vector for three specific words: cat, dog, and chair.
With a sufficiently large corpus of English text, we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters. And if we generate “random words” with progressively longer n-gram probabilities, they get progressively “more realistic". If we were able to use sufficiently long n-grams we’d basically “get a ChatGPT”, in the sense that we’d get something that would generate essay-length sequences of words with the “correct overall essay probabilities”.
领英推荐
With approx. 40,000 or so common words in the English language, the number of "2-grams" is 1.6 billion and the number of possible "3-grams" is 60 trillion. By the "20-word" gram, the number of possibilities is greater than the total number of particles in the universe!!
That’s the problem with this approach- there just isn’t even close to enough english text that’s ever been written to be able to deduce those probabilities. Hence, we need to use Models.
These models have parameters (set of “knobs" you can turn) to fit the model to the data. In the case of ChatGPT, lots of such knobs are used, actually, 175 billion of them. By comparison, the human brain has about 100 billion neurons.
How were all those 175 billion weights in its neural net determined? Basically, they are the result of very large-scale training based on a huge corpus of text on the web, in books, etc. written by humans.
Now the question is: From this list, which one should it actually pick? Going with the “highest-ranked” word makes sense, but this is where a bit of "voodoo" begins to creep in. If we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay. The fact that there’s randomness here means that if we use the same prompt multiple times, we’re likely to get different essays each time. There’s a particular "temperature” parameter" that determines how often lower-ranked words will be used, and for essay generation, it turns out that a “temperature” of 0.8 seems best. There’s no “theory” behind this, it’s what’s been found to work in the "real world".
Embeddings & Linguistic Feature Space:
Neural Nets are based on numbers. So if we want them to work for texts we need a way to represent our text with numbers. ChatGPT assigns a number to every word in the dictionary. There is a central idea in ChatGPT that goes beyond this. It's the idea of "embedding"-think of it as a way to represent the "essence" of something by an array of numbers, such that nearby things are represented by nearby numbers.
Inside chatGPT any piece of text is effectively represented by an array of numbers that we can think of as coordinates of a point in some kind of “Linguistic Feature Space”. When ChatGPT continues a piece of text this corresponds to tracing out a trajectory in a linguistic features space. Here is an example of how words corresponding to different parts of speech get laid out if we project such a feature space down to just 2D.
Now let's look at the trajectory that a prompt from chatGPT follows in feature space-& then we can see how ChatGPT continues: "The best thing about AI is its ability to learn".
What we see in this case is that there’s a “fan” of high-probability words that seem to go in more or less a definite direction in the feature space (bold black lines). Below is a 3D representation of what's going on for 40 steps.This seems like a mess & doesn't do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what chatGPT is doing inside.
As of now, we are not ready to “empirically decode” from its “internal behavior what chatGPT has “discovered" about how human language is “put together”.
Conclusion:
The basic concept of ChatGPT is at some level rather simple, start from a huge collection of human-created text, then train it on a neural net to generate texts, in particular, make it able to start from a “prompt” & then continue with text that's “like what it has been trained with”.
The actual neural net in ChatGPT is made up of very simple elements, billions of them. The basic operation of a neural net is also very simple, consisting essentially of passing input derived from the text it has generated so far “once through its elements”, without any loops, etc. But the remarkable & unexpected thing is that this process can produce text that is "human-like".
At some level, it's a great example of the fundamental scientific fact that large numbers of simple computational elements can do remarkable and unexpected things.
Interesting read. The speed at which this program operates makes one wonder how much of today's articles etc start with GPT
Very interesting read, thak you fir sharing
Student at Berkley
1 年Very interesting & insightful, this is extremely informative.
Pharma and Life Sciences Innovator | 24+ yrs Optimising Business Processes, Outsourcing Strategies | Harnessing Generative AI and Data Science to Revolutionise Development, Commercialization, and Operational Efficiency
1 年Thanks for taking the time to pen your thoughts on this. I do think more than using the technology. It’s important to understand how it is constructed.
Director, Customer Success @ NetSuite | Leads Social Impact, NFP, Gov, Edu Industry for North America
1 年Thank you, this was helpful!