How Large Language Models Understand Your Writing (1/2)
Large Language Models (LLMs) are types of artificial intelligence (AI) programs trained on extensive datasets – hence the name “large” – which enables them to recognize, understand, and create natural language text, among other tasks. Thanks to their significant contributions to advancing generative AI, LLMs have recently gained widespread recognition. They have also become a focal point for organizations looking to integrate artificial intelligence into a variety of business operations and applications.
LLMs learn to comprehend and process human language and other complex data through exposure to massive amounts of examples, often meaning thousands or millions of gigabytes worth of text from all over the internet. These models leverage deep learning to figure out the relationships between characters, words, and sentences by probabilistically analyzing unstructured data. It allows them to identify different types of content autonomously, without needing direct human guidance.
Whether it is to understand questions, craft responses, classify content, complete sentences, or even translate text into a different language, these AI models can be tailored to solve specific problems in different industries. Just like super readers inside a giant library filled with books, they absorb tons of information to learn how language works. In this article, we will dive in to explore the fascinating world of Large Language Models and how they work inside.
?
Major Features of Large Language Models
Large
General Purpose
Pre-trained, fine-tuned, and Multimodal
Core Mechanics of Large Language Models
At the heart of LLMs, we encounter the transformer model, which is crucial for understanding how these models operate. Transformers are built with an encoder and a decoder, processing data by breaking down inputs into tokens. They perform complex mathematical calculations to analyze the relationships between these tokens, and therefore come up with an outcome. In essence, the encoder “encodes” the input sequence and passes it to the decoder, which learns how to “decode” the representations for a relevant task.
Transformers enable a computer to recognize patterns similar to human cognition. These models leverage self-attention mechanisms, allowing them to learn more rapidly compared to older models, such as long short-term memory (LSTM) models. The self-attention mechanism allows them to process each segment of a word sequence while considering the context provided by other words within the same sentence.
To illustrate the concept of self-attention with a practical example, let’s look at how a transformer model tackles the task of translating a sentence. Imagine we are translating the sentence: "The cat sat on the mat."
Input Encoding: The first step involves converting the input sentence into a series of word embeddings. Every word gets turned into a vector that represents its semantic significance in a high-dimensional space. Word embedding effectively captures a word's meaning, ensuring that words positioned closely within the vector space share similar meanings.
Example: [Embedding for 'The', Embedding for 'cat', Embedding for 'sat', Embedding for 'on', Embedding for 'the', Embedding for 'mat', Embedding for '.']
Generating Queries, Keys, and Values: Following that, the self-attention mechanism produces three different forms of input embeddings: queries, keys, and values. These are created by linear transformations of the original embeddings and play a key role in computing the attention scores.
Queries: [Query for 'The', Query for 'cat', Query for 'sat', Query for 'on', Query for 'the', Query for 'mat', Query for '.']
Keys: [Key for 'The', Key for 'cat', Key for 'sat', Key for 'on', Key for 'the', Key for 'mat', Key for '.']
Values: [Value for 'The', Value for 'cat', Value for 'sat', Value for 'on', Value for 'the', Value for 'mat', Value for '.']
Random data:
Queries: [[0.23,0.4,.67,....],[0.4,0.6,.67,....],[0.2,0.2,.67,....],[0.5,0.3,.8,....], [0.1,0.4,.67,....], [0.2,0.4,.67,....],[0.7,0.4,.6,....]]
Keys: [[0.1,0.4,.5,....],[0.2,0.4,.67,....],[0.3,0.4,.67,....],[0.4,0.4,.67,....], [0.5,0.4,.67,....], [0.6,0.7,.8,....],[0.6,0.4,.8,....]]
Values: [[0.4,0.5,.67,....],[0.23,0.4,.5,....],[0.23,0.4,.8,....],[0.23,0.4,.45,....],[0.23,0.4,.9,....],[0.23,0.4,.6,....],[0.23,0.4,.10,....]]
Determining Attention Scores: The process involves computing the dot product between a query and all the keys, which generates the attention scores. These scores indicate the significance or relevance of each word in relation to the word currently under consideration.
Randoms Scores Example:
Attention scores for 'The': [0.9,0.7,0.5, 0.4,0.45,0.56,0.23]
Attention scores for 'cat': [0.6,0.5,0.7, 0.23,0.44,0.58,0.23]
.....
Attention scores for '.': [0.3,0.5,0.9, 0.4,0.45,0.56,0.23]
Using SoftMax: The attention scores are processed through the SoftMax function, transforming them into probabilities. This step guarantees that the attention weights total 1, reflecting the proportional significance of each word concerning the word being analyzed.
Example: Softmax of attention scores 'The': [0.29, 0.1, 0.12, 0.14, 0.1, 0.1, 0.14]
Calculating Weighted Sum: The attention probabilities obtained from the SoftMax function are then utilized to calculate a weighted sum of the values. This final vector provides a context-sensitive depiction of the current word, factoring in its connections with the rest of the words in the sequence.
Example: Context-aware representation : [0.29 * Value for 'The' + 0.1 * Value for 'cat' + 0.12 * Value for 'sat' + …]
The resulting representation captures the contextual significance of all words, considering their associations with other words in the sentence – which enhances the model’s predictive capabilities. After the multiplication with the values, you end up with a 2D matrix. Then, the language model selects the option with the greatest likelihood. This method is known as the "Greedy Approach," where there is a lack of creativity as the model consistently opts for the same word. On the contrary, the language model can also make its choice randomly, leading to more creative outcomes.
In the sentence: The cat sat on…, the next word is most likely going to be “the”. However, if we choose randomly among other options, we can get something like “bottle” or “plate”, which obviously has a much lower probability. To control the creativity levels, we need to adjust the temperature parameter that influences the models’ results. The temperature is a numerical value (often set between 0 and 1, but sometimes higher) that is critical for fine-tuning the model’s performance.
Adjusting Temperature: This parameter is directly incorporated into the SoftMax function. In a nutshell, If we are after the same old answers with zero creativity, we have to decrease the Temperature. But if we want something fresher and more out-of-the-box, then we have to increase the parameter value. Following, we describe how different temperature values modify the probability distribution of the next word in a sentence:
To wrap up the first part of our exploration into Large Language Models (LLMs), we've seen how these advanced AI constructs have revolutionized the way we interact with digital technology. By understanding the vast datasets they've been trained on, LLMs are capable of performing a myriad of tasks that span across understanding, generating, and translating languages. From their core mechanics, driven by the groundbreaking transformer model, to their ability to fine-tune responses based on the context, LLMs embody a significant leap towards more intuitive and efficient human-computer interactions.
As we've delved into the intricacies of how these models process and generate language, it's clear that the potential applications are as vast as the data they learn from. However, the journey doesn't end here. In the following part of our series, we will navigate through the different types of Large Language Models, highlighting their unique capabilities, challenges, and the future they're shaping in various industries.
Stay tuned as we continue to explore the remarkable world of LLMs and their impact on our digital lives.