"How does ChatGPT work, under the hood?" - no math, no code.

"How does ChatGPT work, under the hood?" - no math, no code.

It is about time that I write about how the most visible chatbot in the world operates under the hood. I have noticed that a lot of people in the industry still struggle to explain the inner workings of the generative language model called chatGPT. So I decided to dedicate an article to it. Two years too late. So, if you already know the intricate mechanics of the transformer architecture, the encoder-decoder architecture, etc. you can skip this entire article.


The core components of generative LLMs

The purpose of this overview is simple. The quality of generative language models has drastically improved in the last year (see above), and we want to understand what changes and new techniques catalyzed this boost in quality. Here, we will stick to transformer-based language models, though the concept of a language model predates the transformer architecture — dating back to recurrent neural network-based architectures (e.g., ULMFit [4]) or even n-gram language models.

Top-level view. To explain generative LLMs in a clear and simple manner, we must first identify the key ideas and technologies that underlie them. In the case of this overview, we will focus upon the following three components:

  1. Transformer architecture: the neural network architecture used by LLMs.
  2. Language model pretraining: the (initial) training process used by all LLMs.
  3. The alignment process: how we teach LLMs to behave to our liking.

Together, these ideas describe the technology that powers generative LLMs like ChatGPT. Within this section, we will develop a working understanding of these key ideas and how they combine together to create a powerful LLM.

Transformer Architecture

Nearly all modern language models are based upon the transformer architecture (shown above) [5] — a deep neural network architecture that was originally proposed for solving Seq2Seq tasks (e.g., summarization or language translation). The transformer takes a sequence of text as input and is trained to perform some generative (e.g., summarization) or discriminative (e.g., classification) task. We will now overview the transformer architecture and how it is leveraged by LLMs to produce coherent and interesting sequences of text. For a more in-depth explanation of the transformer architecture in general, check out the link here.


Subscribe to the TechTonic Shifts newsletter

Encoder-decoder architecture. The transformer has two primary components: the encoder and the decoder. The encoder looks at the full sequence of text provided as input and builds a representation of this text. Then, the decoder takes this representation as input and uses it to produce an output sequence. For example, if we train a transformer to translate a sentence from English to Chinese, the model will perform the following processing steps:

  • The encoder ingests the English sentence.
  • The encoder outputs a representation for the English sentence.
  • The decoder ingests the encoder’s representation of the English sentence.
  • The decoder generates the Chinese translation.

Decoder-only architecture. Although the originally-proposed transformer has both an encoder and a decoder module, generative LLMs primarily use a decoder-only transformer architecture; see above. This architecture eliminates the encoder from the transformer architecture, leaving only the decoder. Given that the decoder’s role in the transformer is to generate textual output, we can intuitively understand why generative LLMs only use the decoder component — the entire purpose of these models is to generate sequences of text!

Constructing the input. To better understand how decoder-only transformers operate, let’s take a deeper look at the input to this model and how it is created. Generative LLMs take a sequence of text as input. However, we can’t just pass raw text or characters into the transformer. First, we tokenize this text, or break it into a sequence of tokens (i.e., words or sub-words); see below.

After tokenizing the input sequence, we convert each token into an associated, unique vector representation, forming a list of vectors that correspond to each token; see below. These token vectors are lists of numbers that quantitatively describe each token. After adding positional embeddings — or extra vectors that capture the position of each token within the input sequence — to these token vectors, we are left with the final input that is passed into the decoder-only transformer!

Token vectors are stored in an embedding layer — a big matrix, where each row stores the vector for a single token — that is part of the transformer architecture. At first, these token vectors are randomly initialized, but we learn them during the training process just like any other neural network parameter!

Processing the input. Now that we understand how the model’s input is constructed, the next step is to better understand how each of the model’s layers transform this input. Each “block” of the decoder-only transformer takes a list of token vectors as input and produces a list of transformed token vectors (with the same size) as output. This transformation is comprised of two operations:

  1. Masked Self-Attention: transforms each token vector by considering tokens that precede it in the sequence; see above and here for more details.
  2. Feed-forward transformation: transforms each token vector individually via a sequence of linear layers and non-linearities (e.g., ReLU); see below.

Together, masked self-attention and feed-forward transformations allow us to craft a rich representation of any textual sequence. These components each play a distinct and crucial role:

  • The attention component grabs useful information from the broader sequence (i.e., tokens that precede the current token).
  • The feed-forward component learns a pattern or transformation that is individually applied to each token vector.

By stacking several decoder-only transformer blocks on top of each other, we arrive at the architecture that is used by (nearly) all generative LLMs!

Language Model Pretraining

“Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. The general technique of self-supervised learning is to predict any unobserved or hidden part (or property) of the input from any observed or unhidden part of the input.” — from [7]

Now that we understand the transformer, let’s dive in to the next key idea that underlies generative LLMs — self-supervised pretraining. Self-supervised learning refers to the idea of using signals that are already present in raw data to train a machine learning model. In the case of generative language models, the most commonly-used objective for self-supervised learning is next token prediction, also known as the standard language modeling objective. Interestingly, this objective — despite being quite simple to understand — is the core of all generative language models. We’ll now explain this objective, learn how it works, and develop an understanding of why it is so useful for generative LLMs.

LLM pretraining. To pretrain a generative language model, we first curate a large corpus of raw text (e.g., from books, the web, scientific publications, and much more) to use as a dataset. Starting from a randomly initialized model, we then pretrain the LLM by iteratively performing the following steps:

  1. Sample a sequence of raw text from the dataset.
  2. Pass this textual sequence through the decoder-only transformer.
  3. Train the model to accurately predict the next token at each position within the sequence.

Here, the underlying training objective is self-supervised, as the label that we train the model to predict — the next token within the sequence — is always present within the underlying data. As a result, the model can learn from massive amounts of data without the need for human annotation. This ability to learn directly from a large textual corpus via next token prediction allows the model to develop an impressive understanding of language and knowledge base. For a more in-depth discussion of the pretraining process, check out the link here.

Inference process. Generative LLMs use a next token prediction strategy for both pretraining and inference! To produce an output sequence, the model follows an autoregressive process (shown above) comprised of the following steps:

  1. Take an initial textual sequence (i.e., a prompt) as input.
  2. Predict the next token.
  3. Add this token to the input sequence.
  4. Repeat steps 2–3 until a terminal/stop token (i.e., <EOS>) is predicted.

Keys to success. A decoder-only transformer that has been pretrained (using next token prediction) over a large textual corpus is typically referred to as a base (or pretrained) model. Notable examples of such models include GPT-3 [3], Chinchilla [9], and LLaMA-2 [10]. As we progressed from early base models (e.g., GPT [1]) to the models that we have today, two primary lessons were learned.

  1. Larger models are better: increasing the size (i.e., number of trainable parameters) of the underlying model yields a smooth increase in performance, which led to the popularization of large language models.
  2. More data: despite yielding a smooth increase in performance, increasing the size of the underlying model alone is suboptimal. We must also increase the size of the underlying pretraining dataset to get the best results.

Put simply, the best base models combine large model architectures with massive amounts of high-quality pretraining data.

Computational cost. Before moving on, we should note that the pretraining process for generative LLMs is incredibly expensive — we are training a large model over a massive dataset. Pretraining costs range from several hundreds of thousands (see above) to even millions of dollars, if not more. Despite this cost, pretraining is an incredibly important step — models like ChatGPT would not be possible without creation of or access to a high-quality base model. To avoid the cost of language model pretraining, most practitioners simply download open-source base models that have been made available online via platforms like HuggingFace.

The Alignment Process

“Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.” — from [6]

After pretraining, the LLM can accurately perform next token prediction, but its output is oftentimes repetitive and uninteresting. For an example of this, we can look back to the beginning of this overview! Namely, GPT-2 (i.e., a pretrained LLM) struggles to produce helpful and interesting content. With this in mind, we might ask: What do these models lack? How did we get from this to ChatGPT?

The alignment process, which teaches a language model how to generate text that aligns with the desires of a human user, is the answer to the question above. For example, we can teach the model to:

  • Follow detailed instructions
  • Obey constraints in the prompt
  • Avoid harmful outputs
  • Avoid hallucinations (i.e., generating false information)
  • Pretty much anything we can demonstrate to the LLM!

The objectives of the alignment process are typically referred to as alignment criteria, and we (i.e., the researchers training the model) must define these criteria at the outset of the alignment process. For example, two of the most commonly-used alignment criteria within AI research (e.g., see LLaMA-2 [10] and Constitutional AI [11]) are helpfulness and harmlessness. To instill each of these alignment criteria within the model, we perform finetuning via supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF).

SFT is simple to understand, as it is based upon the same objective as pretraining — next token prediction. However, instead of training the model over a bunch of raw text downloaded from the web, we use a more specific training dataset. Namely, we train the model on alignment-focused data, or demonstrations — either written by humans or generated synthetically by another LLM — of prompt and response pairs that satisfy the set of desired alignment criteria; see above. For a more in-depth discussion of SFT, check out the link here.


Subscribe to the TechTonic Shifts newsletter

Although SFT is effective and widely-used, manually curating a dataset of high-quality model responses can be difficult. RLHF solves this issue by directly finetuning an LLM based on human feedback; see below for a depiction.

This process proceeds in two phases. During the first phase, we start with a set of prompts and use the LLM — or a group of several LLMs — to generate two or more responses to each prompt. Then, a group of human annotators ranks responses to each prompt based on the defined alignment criteria, and we use this data to train a reward model that accurately predicts a human preference score given a prompt and a model’s response as input. Once the reward model is available, we use a reinforcement learning algorithm (e.g., PPO) during the second phase of RLHF to finetune the LLM to maximize human preference scores, as predicted by the reward model. For a more detailed discussion of RLHF, check out the link here.

Finetuning pipeline. Although SFT and RLHF are standalone techniques, most state-of-the-art LLMs use the three-step alignment process, including both SFT and RLHF. This approach, depicted below, became a standard within LLM research after the proposal of InstructGPT [6] — the predecessor to ChatGPT.

Putting Everything Together

As long as we understand the concepts that have been discussed so far, we now have a working understanding of ChatGPT. In particular, modern generative LLMs i) use a transformer architecture, ii) are pretrained over a large corpus of text downloaded from the web, and iii) are aligned via SFT and RLHF; see above.

The progression of LLM research. To understand how these three language modeling components have influenced the development of LLMs and AI, we can study the progression of prominent papers within the language modeling domain; see below. Before the advent of the transformer architecture, language models were still around, but they were based upon simpler architectures (e.g., recurrent neural networks). Although the transformer was originally proposed to solve Seq2Seq tasks, we quickly saw the decoder-only variant of the transformer architecture applied to the language modeling domain with GPT and GPT-2 [1, 2].

These early, transformer-based language models performed well, but we didn’t see truly impressive performance until much larger language models (i.e., true LLMs), such as GPT-3 [5], were explored. These models were found to have impressive few-shot learning capabilities across a variety of tasks as we scale them up in size. However, model size alone is not enough! We learn from Chinchilla [9] that the best base models are created by combining large models with large pretraining datasets — both parameter and data scale are important.

Pretrained LLMs can accurately solve many tasks, which led to recognition within the AI community. However, these models were oftentimes repetitive or uninteresting, struggled to follow instructions, and generally lacked in helpfulness. To solve this problem, the alignment process directly finetunes the LLM based on a set of criteria that describe desired model behavior. Such an approach drastically improves the quality of the underlying model, leading to the creation of powerful models like ChatGPT and a subsequent explosion of interest in generative LLMs; see below. In other words, alignment is (arguably) the key advancement that made the powerful LLMs we see today possible.

What’s next? At this point, we understand everything that goes into creating an LLM like ChatGPT. All that is left is to learn how to apply such models to solving practical problems. To solve downstream tasks with an LLM, there are two basic approaches we can take (shown below):

  1. In-context learning: write a prompt to solve the desired task.
  2. Finetuning: perform further training on data for the desired task.

Key Takeaways

Within this overview, we distilled the key technologies underlying generative LLMs into a three-part framework that includes i) the transformer architecture, ii) language model pretraining, and iii) the alignment process. The key points to remember with respect to each of these components are as follows:

  • Generative LLMs are based upon the decoder-only transformer architecture.
  • First, we pretrain these models over massive amounts of textual data using the next token prediction objective.
  • After pretraining, we use SFT and RLHF (i.e., the three-step technique) to better align these models with the desires of human users.

By developing a deep understanding of these components and learning how to explain them to others (in simple terms), we can democratize understanding of this important technology. Today, AI-powered systems are impacting more people than ever before. AI is no longer just a research topic, but rather a topic of popular culture. As AI continues to evolve, it is paramount that builders, engineers, and researchers (i.e., people like us!) learn to effectively communicate about the technology, its pitfalls, and what is to come.

Connect with me!

Thanks so much for reading this article. I am Marco van Hurne, founder of Beyond the Cloud, a platform for inspiration disemminating knowledge and VP of AI at EXN.


Think a friend would enjoy this too? Share the newsletter and let them join the conversation.


If you like my article, subscribe to my newsletter or connect with me. LinkedIn appreciates your likes by making my articles available to more readers.

Signing off - Marco


Top-rated articles:


Uyen Phuong NGUYEN

Food Regulatory Affairs Specialist Europe Regional Innovation and Certification

4 个月

Wowwww! Thanks so much for putting the work!!!

Philippe MAURY

Mobility | Leasing, Financing, Insurance ??NED & Advisor & Mentor | ?? Director certified by INSEAD, Experienced CEO certified by ACPR.

4 个月

Very helpful!

要查看或添加评论,请登录

Marco van Hurne的更多文章

社区洞察

其他会员也浏览了