jilin daxue xuebao (diqiu kexue ban)/journal of jilin university (earth science edition).Enjoy Free 888+200 Daily Legal Bonus

Large Language Models (LLMs) are types of artificial intelligence (AI) programs trained on extensive datasets – hence the name “large” – which enables them to recognize, understand, and create natural language text, among other tasks. Thanks to their significant contributions to advancing generative AI, LLMs have recently gained widespread recognition. They have also become a focal point for organizations looking to integrate artificial intelligence into a variety of business operations and applications.

LLMs learn to comprehend and process human language and other complex data through exposure to massive amounts of examples, often meaning thousands or millions of gigabytes worth of text from all over the internet. These models leverage deep learning to figure out the relationships between characters, words, and sentences by probabilistically analyzing unstructured data. It allows them to identify different types of content autonomously, without needing direct human guidance.

Whether it is to understand questions, craft responses, classify content, complete sentences, or even translate text into a different language, these AI models can be tailored to solve specific problems in different industries. Just like super readers inside a giant library filled with books, they absorb tons of information to learn how language works. In this article, we will dive in to explore the fascinating world of Large Language Models and how they work inside.

?

Major Features of Large Language Models

Large

Huge amount of data for training the models (petabyte range)
Hyperparameter count – The knowledge and skills that the machine learns and remembers after being trained

General Purpose

The ability of LLMs to solve common problems, given the universal nature of human language and the resource limitations (Only a handful of companies are able to develop such complex models, aka foundational models that others can use)

Pre-trained, fine-tuned, and Multimodal

Pre-trained on huge datasets for a general purpose
Fine-tuned for specific tasks, using a relatively small size of field datasets – such as retail, finance, medical, or entertainment data
Can be multimodal which combines text and visual information
The performance grows continuously after adding more information and parameters

Core Mechanics of Large Language Models

At the heart of LLMs, we encounter the transformer model, which is crucial for understanding how these models operate. Transformers are built with an encoder and a decoder, processing data by breaking down inputs into tokens. They perform complex mathematical calculations to analyze the relationships between these tokens, and therefore come up with an outcome. In essence, the encoder “encodes” the input sequence and passes it to the decoder, which learns how to “decode” the representations for a relevant task.

Transformers enable a computer to recognize patterns similar to human cognition. These models leverage self-attention mechanisms, allowing them to learn more rapidly compared to older models, such as long short-term memory (LSTM) models. The self-attention mechanism allows them to process each segment of a word sequence while considering the context provided by other words within the same sentence.

To illustrate the concept of self-attention with a practical example, let’s look at how a transformer model tackles the task of translating a sentence. Imagine we are translating the sentence: "The cat sat on the mat."

Input Encoding: The first step involves converting the input sentence into a series of word embeddings. Every word gets turned into a vector that represents its semantic significance in a high-dimensional space. Word embedding effectively captures a word's meaning, ensuring that words positioned closely within the vector space share similar meanings.

Example: [Embedding for 'The', Embedding for 'cat', Embedding for 'sat', Embedding for 'on', Embedding for 'the', Embedding for 'mat', Embedding for '.']

Generating Queries, Keys, and Values: Following that, the self-attention mechanism produces three different forms of input embeddings: queries, keys, and values. These are created by linear transformations of the original embeddings and play a key role in computing the attention scores.

Queries: [Query for 'The', Query for 'cat', Query for 'sat', Query for 'on', Query for 'the', Query for 'mat', Query for '.'] 
  
Keys: [Key for 'The', Key for 'cat', Key for 'sat', Key for 'on', Key for 'the', Key for 'mat', Key for '.'] 

Values: [Value for 'The', Value for 'cat', Value for 'sat', Value for 'on', Value for 'the', Value for 'mat', Value for '.']
 
Random data:
 Queries: [[0.23,0.4,.67,....],[0.4,0.6,.67,....],[0.2,0.2,.67,....],[0.5,0.3,.8,....], [0.1,0.4,.67,....], [0.2,0.4,.67,....],[0.7,0.4,.6,....]] 
  
 Keys: [[0.1,0.4,.5,....],[0.2,0.4,.67,....],[0.3,0.4,.67,....],[0.4,0.4,.67,....], [0.5,0.4,.67,....], [0.6,0.7,.8,....],[0.6,0.4,.8,....]]
  
 Values: [[0.4,0.5,.67,....],[0.23,0.4,.5,....],[0.23,0.4,.8,....],[0.23,0.4,.45,....],[0.23,0.4,.9,....],[0.23,0.4,.6,....],[0.23,0.4,.10,....]]

Determining Attention Scores: The process involves computing the dot product between a query and all the keys, which generates the attention scores. These scores indicate the significance or relevance of each word in relation to the word currently under consideration.

Randoms Scores Example: 
Attention scores for 'The': [0.9,0.7,0.5, 0.4,0.45,0.56,0.23]
Attention scores for 'cat': [0.6,0.5,0.7, 0.23,0.44,0.58,0.23]
.....
Attention scores for '.': [0.3,0.5,0.9, 0.4,0.45,0.56,0.23]

Using SoftMax: The attention scores are processed through the SoftMax function, transforming them into probabilities. This step guarantees that the attention weights total 1, reflecting the proportional significance of each word concerning the word being analyzed.

Example: Softmax of attention scores 'The': [0.29, 0.1, 0.12, 0.14, 0.1, 0.1, 0.14]

Calculating Weighted Sum: The attention probabilities obtained from the SoftMax function are then utilized to calculate a weighted sum of the values. This final vector provides a context-sensitive depiction of the current word, factoring in its connections with the rest of the words in the sequence.

Example: Context-aware representation : [0.29 * Value for 'The' + 0.1 * Value for 'cat' + 0.12 * Value for 'sat' + …]

The resulting representation captures the contextual significance of all words, considering their associations with other words in the sentence – which enhances the model’s predictive capabilities. After the multiplication with the values, you end up with a 2D matrix. Then, the language model selects the option with the greatest likelihood. This method is known as the "Greedy Approach," where there is a lack of creativity as the model consistently opts for the same word. On the contrary, the language model can also make its choice randomly, leading to more creative outcomes.

In the sentence: The cat sat on…, the next word is most likely going to be “the”. However, if we choose randomly among other options, we can get something like “bottle” or “plate”, which obviously has a much lower probability. To control the creativity levels, we need to adjust the temperature parameter that influences the models’ results. The temperature is a numerical value (often set between 0 and 1, but sometimes higher) that is critical for fine-tuning the model’s performance.

Adjusting Temperature: This parameter is directly incorporated into the SoftMax function. In a nutshell, If we are after the same old answers with zero creativity, we have to decrease the Temperature. But if we want something fresher and more out-of-the-box, then we have to increase the parameter value. Following, we describe how different temperature values modify the probability distribution of the next word in a sentence:

Low Temperature (Below 1.0) – Dialing the temperature below 1 tunes the model towards more predictable and less diverse outputs. It narrows down the model's choices, often opting for the most likely word next, which might make the text seem less creative or varied, and maybe a bit more mechanical. This setting is great for when you want straightforward, less surprising answers.
High Temperature (Above 1.0) – Cranking the temperature above 1 introduces more unpredictability to the text generation. The model ventures beyond the obvious choices, picking less likely words, which can make the content more diverse and potentially more creative. But beware, this can also lead to more mistakes or even nonsensical bits, as the model strays further from its training data's probability paths.
Setting the Temperature to 1.0 – Often the go-to middle ground, a temperature of 1.0 seeks to strike a balance between the predictable and the unpredictable. In this setting, the model produces text that is a mix, neither shifting too much into monotony nor into chaos, which reflects the probability distribution it was trained on.

To wrap up the first part of our exploration into Large Language Models (LLMs), we've seen how these advanced AI constructs have revolutionized the way we interact with digital technology. By understanding the vast datasets they've been trained on, LLMs are capable of performing a myriad of tasks that span across understanding, generating, and translating languages. From their core mechanics, driven by the groundbreaking transformer model, to their ability to fine-tune responses based on the context, LLMs embody a significant leap towards more intuitive and efficient human-computer interactions.

As we've delved into the intricacies of how these models process and generate language, it's clear that the potential applications are as vast as the data they learn from. However, the journey doesn't end here. In the following part of our series, we will navigate through the different types of Large Language Models, highlighting their unique capabilities, challenges, and the future they're shaping in various industries.

Stay tuned as we continue to explore the remarkable world of LLMs and their impact on our digital lives.

How Large Language Models Understand Your Writing (1/2)

Alexis Despeyroux

CEO | Serial Entrepreneur | Tech Enthusiast

更多精彩文章

社区洞察

Harnessing Symbolic AI for Enhanced Software Testing

2024年7月25日

How Large Language Models Understand Your Writing (2/2)

2024年4月16日

Future-Proof Your Company with an Innovation Mindset

2024年3月26日

Navigating the Start-Up Jungle with Effectuation

2024年2月20日

The SaaS Journey: Navigating Challenges and Finding Your Path to Success

2024年2月13日

The Critical Role of Testing in Small Companies: A Dialogue on Quality and Innovation

2024年2月8日

Do not forget your customers !

2019年8月30日

User Conference on Advanced Automated Testing (UCAAT)

2018年7月11日

From Padawan to JEDI Test Design Master—5 Mistakes in Test Design

2016年12月22日

Why Automated Test Design is AGILE?

2016年12月16日

社区洞察