登录查看更多内容

Large Language Models: The Power of Billions of Parameters

Isabel Hong

Software Developer at ThoughtWorks Chile

发布日期: 2023年9月27日

By Isabel Hong and Denis Parra Santander

In the current context of AI and Natural Language Processing, the concept of large language models (LLMs) has dominated the conversation. With LLMs a fascinating convergence occurs, where mathematical principles, computational concepts, linguistics and language philosophy intersect to lay the foundations upon which generative AI for natural language has become possible.

Decades of scientific research, including periods of stagnation known as "AI winters," have paved the way for the impact of generative AI today. OpenAI introduced ChatGPT in November 2022. It was powered by GPT-3.5, a version that evolved from the third-generation GPT language model (LM). GPT-3 was already so large that it had over 175 billion parameters, categorizing it as a large LM, or LLM.

The progress of GPT LLMs has been remarkably fast. GPT-1, released in 2018, contained 117 million parameters. Only a year later, GPT-2 scaled up to 1.5 billion parameters, and it's rumored that GPT-4 possesses trillions of parameters.

As each new version is introduced, the term "parameters" is frequently mentioned along. But what are these parameters? How does the number of parameters in LLMs relate to their capabilities? And, fundamentally, what is a language model?

Understanding GPT as a Function

Before answering these questions, it's important to recognize that at its core, GPT operates as a mathematical function. GPT takes a user's input in the form of a text prompt (let's call it x), processes it as a function would do (let's call this f(x)) and outputs the next most appropriate word to answer a question or continue a conversation (let's call it y). Then, the model uses both x + y to continue generating text and so on iteratively. In other words, after generating a word, the new input is the initial user's prompt plus the output that GPT generated itself.

Image of text prompt > GPT model > probability distribution over words

What is a Language Model?

In essence, a language model is a probability distribution over words. A probability distribution is a mechanism, or function, that assigns values between 0 and 1 to a set of elements, with the constraint that all the probabilities assigned must add up to 1. For instance, if we asked the probability distribution over the words following the sentence "In winter I usually feel really" it might look something like this:

cold : 0.8

sad : 0.15

happy : 0.04

relaxed : 0.01

The fact that we provide a conditioning context (the initial sentence) to assign the probabilities to the following words, indicates that this is a "conditional" probability distribution. This means that if we change the context, the probabilities might change. Let's say, for an alternative sentence such as "In spring I usually feel really" the probabilities can change to:

cold : 0.01

sad : 0.04

happy : 0.8

relaxed : 0.15

In simpler terms, given a prompt, the language model calculates the probability of the next sequence of words in the context of human language.

As of the time of writing, GPT predicts the probability distribution across the entire vocabulary, encompassing a very large possible combination of phrases and words found in human language. While there isn't an exact count of total words in a particular language, and the total number of words is constantly evolving, English, for example, has around 170,000 words in active use. So it must choose in each step, one word among potentially 170,000 words to generate a complete and accurate answer.

And here’s the thing: GPT is doing this all the time, seemingly like an iterative autocomplete tool. But GPT is more than just an autocomplete tool. It doesn't only predict the next word; it can generate entire paragraphs when prompted through this iterative process.

GPT can give the sense of capturing the user's intention. This is the result of the billions of parameters of the neural network that can represent the complex nuances of language. Now, let's dive deeper into these two terms: neural networks and parameters.

Screenshot of a conversation with ChatGPT

Understanding the Model's Parameters: Weights and Biases

An LLM can be implemented with different models, with the most predominant ones nowadays being artificial neural networks, also known as neural networks. A neural network learns to recognize language patterns by being trained through a series of input-output pairs. The architecture of these neural networks was initially inspired by how biological neurons connect in the human brain.

To have a working neural network model that generates sentences, we first need to train it with pairs of inputs and corresponding outputs for a task, which means we are changing something in its internal representation to perform a task, as when we train our pets to sit after a signal. Once the model is trained, we can stop training ("freeze its parameters") and simply use it for inference, which means to output a prediction provided a given input.

Prof. Ahmed Banafa 1 年前

Understanding Large Language Models (LLMs): A…

tCognition 3 个月前

AMR Future Brief| Why Have Large Language Models…

Allied Market Research 2 个月前

In the following illustration, we can observe a neural network where neurons are depicted as circles or nodes, and the connecting lines represent various weights (parameters), each being a numerical value.

Green nodes: Input layer (receives data)

Red nodes: Output layer (produces results or predictions)

Blue nodes: Hidden layers (intermediary layers processing input data) The several layers characterizes this as a "deep" neural network

Edges between nodes: weights or parameters of the neural network, which are updated during the training phase

During the training phase, weights and biases (the parameters of the network) are continuously adjusted to minimize prediction errors. These parameters act like a radio dial, tuning the model's performance.

In the case of GPT-3, the 175 billion parameters refer to the model's size, specifically to the quantity of weights and biases of the model.

The Two Phases in Neural Networks: Training and Inference

LLMs encompasses two phases: training and inference. During the training phase, we feed the network pairs of input-output data, such as large amounts of text examples with their expected outputs. Imagine a language model that translates English to French, where the model has inputs, predicts some outputs, and we have the actual expected output (ground truth).

Table of input, output and expected output phrases in English and French

In this phase, the model's parameters or weights are updated or adjusted based on whether the output aligns with the expected output. The billions of parameters in GPT means the need for more data and computational power – a trade-off for a better language model. In other words, more parameters lead to better performing models, such as ChatGPT.

Following the training phase, we enter the inference phase where the model's parameters are no longer adjusted and we "freeze" the model in its already- trained state. Here, a new phrase is input into the model, and its quality is evaluated based on whether it can arrive at the correct answer.

The significance of having more parameters lies in the improved quality of the model which outputs better results. So when we train the model for a particular task and then use it for another task, we refer to it as being "pre-trained." Since GPT was not originally trained, for instance, to complete code but can do it quite well, among other language feats, it is called a Generative Pre-trained Transformer. The term "transformer" comes from the special type of neural network used.

These billions of parameters in GPT represent the learned variables acquired during the model's training process. They enable the model to capture language patterns, resulting in context-based predictions in response to input.

Language in Context

GPT captures the meaning of words in different contexts, which is very difficult due to polysemy (the fact that a single word might have different meanings). The idea of capturing the meaning of a word through its contextual usage traces back to Ludwig Wittgenstein, a language philosopher who wrote in Philosophical Investigations that "The meaning of a word is its use in the language."

For instance, if we provide a vague input like "The sky," it results in too many potential outputs, which leads to a state of uncertainty or, in other words, high-entropy output. This ambiguity, or having too many possibilities, makes it challenging to predict what follows.

So, there is an equal possibility for any of the three sentences to follow.

(The sky) "is clear and filled with stars."

(The sky) "is filled with clouds."

(The sky) "has shades of pink, orange and purple."

However, if we give it more context, we lower the entropy, and the model can predict a particular sentence with higher probability.

For example, if the input has a more precise context: "During sunset, the sky" there is a higher probability that what follows is "has shades of pink, orange and purple."

With billions and possibly trillions of parameters, neural LLMs such as GPT not only predict words but also capture context, leaving us in awe of the remarkable ability to do so through statistics and computational power.