Top 23 Concepts in LLMs
AI Art created with RunwayML + CC Express for Font

Top 23 Concepts in LLMs

As Large Language Models (LLMs) and Generative AI continue to rule the zeitgeist, technical jargon can create an unnecessary barrier that prevents non-technical folks from diving deeper into the space, participating more actively, and making the most out of the technology. Here is my attempt at creating an easily understandable guide that demystifies the top 23 concepts in the LLM field so anyone can be more fluent on the topic.

Navigate this guide at your own leisure, skip known concepts, and if you notice I missed one or have suggestions, please write your feedback in the comments. Happy learning!

Table of Contents:

1. Foundation models

2. Multimodal models

3. Proprietary vs Open Source Models

4. Neural Networks

5. Parameters

6. Pre-training

7. Training Corpus

8. Back Propagation & Gradient Descent

9. Fine-tuning

10. Benchmarks

11. Non-determinism & probabilistic outputs

12. Hallucinations

13. Prompt engineering

14. Tokens, Strings & Tokenization

15. Context Window

16. Zero-shot & Few-shot prompting

17. Chain-of-Thought prompting

18. Reflexion & Iterative self-improvement

19. Alignment

20. Embeddings

21. Vector Database

22. Cosine Similarity

23. Agents

Foundation models

Foundation models, often referred to as 'base models,' are large machine learning models trained on massive datasets for broad applications that can be customized for more specialized tasks. Text generation models (LLMs) like OpenAI's GPT series (including GPT-3, 3.5 turbo, and 4), Google’s PaLM, and Anthropic’s Claude exemplify this category. For image generation, models such as Stable Diffusion, Dall-E2, and Midjourney stand as notable instances.

Multimodal models

Generative models can take multiple input formats (text, audio, images, or video) and can generate outputs with single or mixed formats. For example, GPT-4 is a multi-modal model that accepts images and text and can generate text outputs from that combination. Open AI demoed how GPT-4 can use an image of requirements written in a notepad to output working code for an app. ChatGPT's multi-modal capabilities are not available to the public yet.??

Proprietary vs Open Source Models?

Proprietary LLMs from companies like OpenAI and Google are private, with their architecture, training methods, and data undisclosed to the public. These models are available as paid services or within apps. Conversely, Open Source models are free, publicly available, and can be tailored for use on private servers, devices, or public clouds. Their architectures, training process, and datasets are transparent. Yet, their usage might be limited to non-commercial tasks. Companies such as Meta, Stability AI, and Databricks provide Open Source LLMs with commercial licenses. While proprietary models are generally larger and outperform OS models on many benchmarks, they can be costly. Meanwhile, Open Source models are evolving to become more performant and affordable, positioning them as viable alternatives for comparable tasks.

No alt text provided for this image

?Source: Madrona

Neural Networks?

Neural Networks are machine learning models inspired by the human brain's structure, with the objective of making predictions. They consist of layers of 'neurons' or nodes interconnected with each other. In the image below, the input layer accepts initial data. Each subsequent layer processes the output from its predecessor. Neurons within a layer act on the data they receive and pass on their results to the next layer. This cycle continues until the final output emerges from the last layer, known as the output layer. As an illustration, when trained to classify images, a neural network might receive a cat's image. This image is dissected pixel by pixel and processed through the network, with each layer interpreting and modifying the pixel data. Ultimately, the output layer offers a prediction, suggesting the network's confidence that the image represents a cat.

No alt text provided for this image
Composited this diagram from a few sources online to make a clear example

LLMs are neural networks with a specific type of architecture called “Transformers”. GPT stands for Generative Pre-trained Transformer, and its objective is to accurately predict the next word of a sentence while generating coherent & accurate content.?

Parameters?

In LLMs, these are the internal variables - weights and biases- used to make a prediction (aka inference) based on a given input. When a model is being “trained”, that means its parameters or weight & biases are being “tuned” with training data, so that it generates the desired output for a given task. When we talk about the size of an LLM, we refer to the number of parameters in the model. The larger the model, the more general and accurate their outputs tend to be. However, they also take longer for each inference and are more expensive to train and use.?

No alt text provided for this image

Source: Roundhill Investments

Pre-training?

It is the process where the foundation model is initially trained by its creator, with a vast and diverse corpus of data to perform desired tasks. For LLMs, pre-training helps the model learn language syntax, facts, and underlying patterns to accurately predict the next word in a sentence.?

Training Corpus ?

This refers to the collection of datasets used to pre-train the models. For example, most LLMs use a combination of licensed content, proprietary content & open-source datasets such as The Pile, which has 800GB of diverse English text, including books, web pages, code, math, and content from technical trades.? LAION5B ?is a popular training corpus containing 5.9 billion image-text pairs, used to train image generation models.?

Back Propagation & Gradient Descent?

These are the two primary algorithms used to train neural nets & LLMs, they each play different roles.?

Back-propagation is the method used to calculate the “gradient” of the loss function for the weight & biases in the neural network. Backpropagation tells us the direction we need to move in order to reduce the error. Imagine you're playing a game of "Hot and Cold" where someone hides an object and you're trying to find it. They can't tell you where it is, but they can tell you if you're getting "warmer" (closer) or "colder" (further away) based on your current position. Backprop helps us figure out which weights and biases are responsible for “cold” or warm “guesses” so the network knows which to adjust to make better predictions in the future.?

Gradient descent, on the other hand, is an optimization algorithm used to minimize the loss function. Once we've calculated the gradient of the loss function with respect to the weights and biases (using backpropagation), we can use gradient descent to adjust those weights and biases in the direction that reduces the loss function the most. In the “Hot and Cold” example, gradient descent could help you find the minimum amount of steps to get to the hidden object by reducing your error.??

To summarize, when training a neural network, we use backpropagation to figure out which weights and biases of the network to change, and then we use gradient descent to figure out how to change those to improve the network's accuracy.

Fine-tuning?

Fine-tuning is the process where a foundation model’s weights and biases are tuned to a more specific task or domain by providing it with domain-specific training data. This helps the models achieve better performance in narrower domains. For example, fine-tuning an LLM with medical data can result in a more accurate diagnosing model.???

Benchmarks?

A benchmark is a standardized set of tests used to objectively measure the performance of an LLM. It serves as a reference point against which various models can be compared. For example, the “Massive Multi-Task Language Understanding” (MMLU) is the primary benchmark to measure the knowledge acquired by an LLM during pre-training. It consists of 57 tasks ranging from mathematics, law, and science, to logic, moral reasoning, and computer science. Benchmarks vary in specificity. For example, Human-Eval is the primary benchmark to evaluate the programming ability of LLMs, and Spyder Benchmark is even more specific, as it evaluates performance on Text-to-SQL translation. There are open-source leaderboards that monitor the performance of LLMs across a variety of benchmarks.?

No alt text provided for this image

Source: Llama paper by Meta

Non-determinism & probabilistic outputs

Because LLMs make predictions when they generate text, their output is considered to be non-deterministic. That is, they do not consistently and reliably generate the same answer each time you ask them the same question. Their output is instead probabilistic, where each token of the output has an accuracy probability.?

Hallucinations?

Due to their probabilistic & generative nature, LLMs can “imagine” outputs that are not based on the information contained in their training data but, rather based on patterns they have seen. These fictitious outputs are referred to as Hallucinations.??

Prompt engineering

Prompt engineering is the process of creating a detailed set of instructions for the LLM (aka the prompt) to achieve a desired output. The prompt is written in natural language, but it can contain code, or other media types (images, videos, audio, etc). In professional settings, prompts are generated programmatically, and augmented with different techniques to influence the LLM. Great prompt engineering can significantly impact the accuracy of the model’s output.?

Tokens, Strings & Tokenization?

In LLMs, tokens are the primary units for processing a text input and generating an output. A token can be a whole word, part of a word, or even a single character, depending on the context and language. A string is a sequence of characters, which can form words or sentences. Tokenization is the process of breaking a string into its constituent tokens, enabling the model to analyze the input and make predictions.?

Context Window??

The context window is the maximum number of tokens that an LLM can take as input in a single request to generate an appropriate output. In a prompt, tokens that exceed the context window limit will be ignored by the LLM. Different models have different-sized context windows. For example, GPT-3’s context window is 2048 tokens,? GPT-4 can accept up to 32k tokens, and Anthorpic’s Claude can accept up to 100k. Larger context windows take longer to process and are more expensive but can allow for more complex tasks, such as summarizing entire documents.?

Zero-shot & Few-shot prompting??

Zero-shot prompting refers to giving an LLM a task that was not part of its original training, and seeing how it handles it purely based on its existing knowledge. The idea of Few-shot prompting is to give the LLM a few examples (or “shots”) as part of the prompt so that it can use them as context and combine them with its own knowledge to better perform on new tasks. Few-shot learning is a primary prompt-engineering tactic to increase accuracy.??

Chain-of-Thought prompting

Chain-of-thought (CoT)? refers to a prompting technique that instructs the LLM to break down a problem into a series of intermediate reasoning steps, to improve its ability to solve more complex tasks. For example, you may ask the LLM to solve a problem by specifically following the steps you have outlined in the prompt. Conversely, you may ask it to break down the problem itself and explain it back to you, step-by-step, to encourage the LLM to problem-solve as a human would.?

No alt text provided for this image

Source: Kojima et al 2022

Reflexion & Iterative Self-improvement

Reflexion and iterative self-improvement is another set of techniques used in prompting to enhance the problem-solving capabilities of LLMs by asking them to review and critique their own outputs, and to correct their answers until they are fully satisfied with the answer. This self-guided refinement technique has shown promising results in increasing the problem-solving abilities of LLMs.??

Alignment?

In the context of LLMs, alignment refers to the process of ensuring that the behavior of the models is in accordance with human values, intentions, and desired outcomes. This means making sure the LLMs are safe, follow human ethical standards, and are robust enough to withstand adversarial attacks, so they aren’t easily derailed by unexpected inputs. Alignment remains one of the most challenging research topics in the space.????

Embeddings

In LLMS, embeddings refer to the numerical representation of words and phrases as vectors. Imagine having a dictionary where, instead of word definitions, each word is paired with a list of numbers. These numbers aren't random; they're crafted to capture the meaning and context of each word, which enables LLMs to process the inputs more efficiently. Words with similar meanings or contexts have similar numerical vectors. The same concept applies to images, videos and audio, which can be embedded as vectors as well. ????

No alt text provided for this image

source

Vector Database?

Vector databases are specialized databases designed to handle high-dimensional vectors like word, audio, and image embeddings. They are optimized to store, retrieve and perform similarity search across vectors efficiently. Popular vector databases include FAISS (Facebook AI Similarity Search) & Pinecone. They are used in various applications such as Gen-AI-powered apps, recommendation systems, and search engines.?

Cosine Similarity

When words or sentences are encoded into vectors, the semantic similarity between these pieces of text can be gauged using the cosine similarity metric, a mathematical formula to measure the angle between the vectors. A value close to 1 means the vectors are very similar, to 0 they are not. For instance, when processing natural language questions, a vector database might utilize cosine similarity to compare the user's input with stored information or embeddings to find the most semantically relevant response or data.

Agents?

In the context of LLMs, agents refer to a system that connects an LLM with other components such as tools and memory, so that it can perform specific tasks and take actions based on the input given by a user. For example, BabyAgi is an LLM-powered task management system that can perform tasks like setting reminders, managing to-do lists and scheduling appointments for users.??


Tushaar V

Founder & Owner of LangYantra AI | Student at IIT Madras | Generative AI | LangChain | Cloud | GCP | AWS | Azure | Docker | RAG | NLP | LLMs | Web Development | FastAPI | Next.js

8 个月

Very Useful ??, Thanks a lot

Minoo A.

Data & Analytics VP | AI ML DL| Digital Transformation | Data Strategy | Data Governance | Advanced Analytics | Data Platform & Products | Operational Excellence | Culture Creator | Change Catalyst

1 年

Enjoyed reading the list. Couple of thoughts for your consideration 1. In "Parameters" I'd add the more complex the model, the more accurate and opaquer it is. 2. You've talked about Input and Output layers, maybe consider covering Hidden Layers. Thanks for the article - it's much needed to level the field and demystify DL.

回复
Jason Knight

Design @ ThoughtSpot ? Design Mentor

1 年

I agree. It’s funny that we call them large language models because it is like learning a large language…,??

Andrew von Rosenbach

Manager, Eng. Program Management - Platform @ Cohere AI

1 年

good list! you could add something about serving hardware too - it affects costs and throughputs,and is a decision every team adopting LLMs has to deal with, either explicitly (like on sagemaker etc) or implicitly if you’re just hitting an API from a provider (they’re serving their model on something!)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了