#4: Generative AI and Language Models
Carlos Cesar Martins Ferreira
Oldest son | Brother | Husband | Father (of two) | M.Sc. | Ph.D. Candidate | Above all, optimistic and passionate about life
Introduction
Generative AI has been increasingly used in several different sectors of society, mainly through Large Language Models (LLM) and Small Language Models (SLM). They have the capacity to generate content such as text and images and also automate a range of tasks. Some examples of tasks these models can perform are audio data analysis, customer support, sentiment analysis (from customers to companies), education and training, cybersecurity, and even the development of software and applications due to this capacity for coding generation in different languages. This article aims to provide some basic and useful concepts regarding Generative AI, LLM and SLM.
What is Generative AI?
Generative AI is a class of artificial intelligence systems designed to generate new content based on patterns learned from existing data. Unlike traditional AI, which typically classifies or predicts data, generative AI creates novel outputs that can include text, images, music, and other forms of media. These models learn from vast datasets and use that knowledge to produce original and coherent pieces of content.
How does Generative AI work?
Generative AI uses advanced machine learning techniques, particularly neural networks, to analyse and understand large datasets. The process involves several key steps:
What are Large Language Models (LLM)?
Large Language Models (LLMs) are a subset of generative AI focused on understanding and generating human language. They are characterised by their large number of parameters, which enable them to capture intricate details of language. LLMs are trained on extensive text corpora and can perform various language tasks, such as translation, summarisation, and conversation. Examples include OpenAI's GPT-3 and Google's Gemma.
What are Small Language Models (SLM)?
Small Language Models (SLMs) are scaled-down versions of LLMs designed to perform specific language tasks with fewer resources. While they are less powerful and versatile than LLMs, SLMs are often more efficient and can be tailored for particular applications where the extensive capabilities of an LLM are unnecessary. They are typically used in environments with limited computational resources or where a smaller model can accomplish a focused task. Examples include Microsoft's Phi-3 and Meta's Llama 3. Some differences between LLM and SLM are:
What is a Foundation Model?
A Foundation Model is a large, pre-trained AI model that serves as a base for further customisation and fine-tuning for specific tasks. These models are trained on broad datasets and can be adapted to various applications with minimal additional training. Foundation Models leverage their extensive pre-training to provide a robust starting point for developing task-specific AI systems, significantly reducing the time and resources required for model development.
What is a hallucination in the context of Generative AI?
In generative AI, a "hallucination" refers to the phenomenon where the AI generates content that is plausible but factually incorrect or nonsensical. This occurs because the AI creates outputs based on patterns and probabilities from the training data without understanding the factual accuracy of the information. Hallucinations can be problematic in applications where accuracy and credibility are critical, such as in medical or legal contexts. Addressing hallucinations involves refining the training data, implementing verification mechanisms, and incorporating human oversight.
LLM and SLM
Google, Meta, Microsoft and OpenAI are the leading technology companies in LLM and SLM. Below are non-exhaustive examples of LLM and SLM developed by different companies and the number of parameters used throughout their training process.
Parameters and hyperparameters of language models
The parameters represent the connections within the neural network and define how input data is transformed into output data. These parameters can be of two types: weights and biases.
In a neural network, weights are the values that adjust the input signal. During training, these weights are tuned so that the model can make accurate predictions. Each connection between neurons (nodes) has an associated weight.
Biases are additional parameters that allow the model to shift the activation function, which helps the model fit the data better. Each neuron has its own bias value.
You can check this article to understand better, at a basic level, how weights and biases mathematically work.
领英推荐
Therefore, for instance, a model trained with 7 billion parameters (like Gemma) means that the model contains 7 billion tunable weights that are adjusted during training.
However, besides the parameters (typically initialised randomly or using specific schemes), there are still other parameters which can be manually controlled to improve the efficiency of the models and are called hyperparameters. Some of the hyperparameters are:
Training of Language Models
Training LLM and SLM require significant computational resources, including high-performance GPUs or TPUs and large amounts of memory. The training process involves gathering and preprocessing a large corpus of text data (data preparation), designing the neural network architecture, such as the Transformer Model used in many LLMs (model architecture), running the model through many epochs of training where it learns to minimise the loss function through backpropagation and parameter updates (training) and continuously evaluating the model on a separate validation dataset to ensure it generalises well to new, unseen data (validation).
At the start of the training process, the parameters are typically initialised randomly or using specific schemes before training starts. This is the starting point for the learning process. The model uses many iterations to adjust these parameters during this training process. This adjustment process is guided by an optimisation algorithm (e.g., stochastic gradient descent) that minimises the loss function, which measures the difference between the model’s predictions and the actual data.
The training data influence the parameters. The model learns patterns, relationships, and structures in the data captured in these parameters. For example, a language model trained on a vast corpus of text data learns the statistical properties of the language, such as grammar, semantics, and context.
Capacity and performance of language models
The importance of the parameters is related to the capacity and performance of the models. The number of parameters often correlates with the model’s capacity to learn complex patterns. More parameters can enable the model to capture more intricate details and nuances in the data. Further, models with more parameters can better perform tasks like language understanding and generation, as they have more flexibility to fit the training data.
Other useful terms in the context of language models and beyond
Attention Mechanism – A component of neural networks that allows the model to focus on specific parts of the input sequence when generating output. In the context of language models, it helps the model to weigh the importance of different words in a sentence, enabling it to capture long-range dependencies and improve performance on tasks like translation and text generation.
Backpropagation – Training algorithm for neural networks where the error is calculated and propagated backwards through the network to update the weights. This process helps minimise the loss function by adjusting the weights based on the gradient of the loss function with respect to each weight.
Corpus – A large and structured set of texts used for training language models. It is the primary data source from which the model learns linguistic patterns, vocabulary, grammar, and context.
Embedding – Numerical representation of words or phrases in a continuous vector space. This representation captures semantic relationships between words, allowing the model to process and understand language more effectively. Word embeddings map similar words to similar vector representations.
Fine-Tuning – Process of taking a pre-trained model and adjusting its parameters on a smaller, task-specific dataset. This allows the model to adapt to specific applications and improve performance on particular tasks, leveraging the general knowledge gained during pre-training.
GPU (Graphics Processing Unit) – Specialized hardware device designed for parallel processing, which accelerates the training and inference of neural networks. GPUs are particularly effective for handling the large-scale computations required by deep learning models.
Loss Function – Measures the difference between the model's predictions and the actual values. It quantifies the model's performance and guides the optimisation process during training by providing a metric to minimise through backpropagation.
Neural Network Architecture – Refers to the structure and design of the layers and connections within a neural network. This includes the number and types of layers (e.g., convolutional, recurrent, transformer), the connectivity pattern, and other hyperparameters that define the model's complexity and capabilities.
Perplexity - Is a metric used to evaluate language models. It measures how well a model predicts a sample and is defined as the exponential of the average log-likelihood of the test set. Lower perplexity indicates better performance, meaning the model is more confident in its predictions.
Pre-Training – Is the process of training a model on a large, general-purpose dataset before fine-tuning it on a smaller, task-specific dataset. This approach allows the model to learn a broad set of features and patterns that can be adapted to various tasks through fine-tuning.
Zero-shot Learning – Some models can perform tasks without having been explicitly trained on them. The model leverages its general knowledge to predict unseen data or tasks by understanding the relationships between known and unknown concepts.
Tokenisation – Process of breaking down the text into smaller units called tokens, which can be words, subwords, or characters. These tokens are the basic input units for language models, allowing them to process and generate text.
TPU (Tensor Processing Unit) – Specialized hardware accelerator designed by Google specifically for running machine learning workloads. TPUs are optimised for high-performance computation and efficient training of large-scale neural networks.
Transformer Model – A neural network architecture that uses self-attention mechanisms to process input sequences in parallel rather than sequentially. It has become the foundation for many state-of-the-art language models due to its efficiency and ability to capture complex dependencies in the data.