Unlocking the black box: learning the mechanics of ChatGPT
Motivation
As GPTs transition from academic curiosity to the mainstream, data science leads must deepen their understanding of the approach to brainstorm more effectively with business partners, refine use cases, and troubleshoot implementation.
This note summarizes how I typically explain the ideas behind AI chatbots, e.g., ChatGPT, to fellow data scientists. I assume the audience is at a similar stage in their learning journey as I was not too long ago — knowledgeable about deep learning but just beginning to explore the intricacies of Transformer-based generative AI. Emphasizing the importance of hands-on learning, this concise curriculum includes key papers, links to libraries, and code snippets to serve as inspiration.
Key ideas
Four primary concepts are sequentially applied to create a ChatGPT-like chatbot. Here’s a summary followed by resources for deeper study:
1. Attention: During the 2010s, embeddings were developed to map categorical data, such as tokens (words or syllables), into a lower-dimensional continuous vector space while preserving semantic relationships. Attention mechanisms build on this by updating a word's embedding based on the embeddings of surrounding words, thereby capturing context more effectively.
2. Decoder, e.g., Generative Pre-trained Transformer - GPT: The network predicts the probability of the next token given a prompt. The architecture is surprisingly simple, with the main part being a set of “self-attention heads”, each group independently implementing the Attention mechanism to capture dependencies within the input. To speed things up, all tokens are passed through the network simultaneously, unlike LSTMs that were trained in a slower, sequential, manner. To maintain sequence information, a positional embedding is added to text embedding. Finally, feed-forward layers calculate the probability of each possible next token.
3. Token Selection Algorithm: While GPT predicts the probability of each subsequent token, generating coherent, long-form text requires strategic token selection. Techniques such as greedy search, e.g., always selecting the most likely next token, often result in repetitive, non-human-like text. Instead, methods like beam search, top-k sampling, and top-p (nucleus) sampling are used to ensure quality and diversity in the output.
4. Fine-Tuning with Reinforcement Learning with Human Feedback (RLHF): ChatGPT performs impressively in tasks beyond simple text completion, such as summarization and translation, thanks to extensive fine-tuning. This process involves:
-??Supervised Learning: Using prompts paired with human-generated desired outputs.
-??Human-Ranked Responses: Collecting rank-order feedback from annotators on different responses.
-??Reward Model Training: Developing a reward model that predicts a scalar value reflecting human preferences, which guides reinforcement learning to align the chatbot's outputs with these preferences.
领英推荐
Mastering these concepts is an essential first step. Realistically, most business applications today rely on straightforward calls to proprietary APIs, some prompt engineering, light fine-tuning, and sometimes custom Retrieval-Augmented Generation (RAG) or Agents. Yet, understanding the inner workings of these technologies is critical for comprehending their limitations and determining when a project requires deeper experimentation with the underlying Transformer architecture. I recommend a deeper dive into some of these topics using the links below!
?
Recommended topic-specific resources:
Attention
Transformers, including Decoders
Token Selection
?
Reinforcement learning from human feedback (RLHF)