Introduction to Generative AI - Part II

Introduction to Generative AI - Part II

In Previous article - Introduction to Generative AI - Part I we learned about Foundation models and their lifecycles.

In this article we we learn about different types of Foundation Models.

Types of Foundation Models

To grasp the potential of Generative AI, it is crucial to comprehend the different fundamental models that underpin this technology.

Large language models

While large language models (LLMs) can be built on various architectural frameworks, the transformer architecture has emerged as the predominant choice for cutting-edge models currently in use. These transformer-based LLMs are powerful tools capable of comprehending and generating text that closely resembles human language. Their training process involves ingesting massive volumes of textual data sourced from the internet, books, and other resources, enabling them to discern patterns and relationships between words and phrases.

To gain a deeper understanding of how LLMs operate, explore the following sections, which delve into the concepts of tokens, embeddings, and vectors – essential components that underpin these models.

Tokens : Tokens are the fundamental building blocks that large language models (LLMs) use to represent and process text data. In the context of LLMs, tokens can be thought of as the individual units or pieces that make up the vocabulary of the model.

As an example, the sentence "I want to learn Generative AI" might be broken up into the following tokens: ["I", " want", " to", " learn", " Generative", " AI"]

Embeddings and Vectors : Embeddings are numerical encodings that represent tokens, where each token is associated with a vector, essentially a sequence of numbers. These vectors are designed to encapsulate the semantic meaning and interconnections of tokens with other tokens within the vocabulary. The embeddings are derived through the training process, during which the model learns to capture the contextual nuances and intricate relationships inherent in natural language. This acquired understanding enables the model to comprehend the underlying meaning and implications of tokens within a given context.

As an illustration, the embedding vector corresponding to the token "China" might be situated in close proximity to the vectors representing "Russia" and "Japan" within the embedding space. This spatial closeness signifies that these tokens share semantic similarities and are contextually related. Through this embedding-based representation, the model gains the ability to comprehend that "Moscow" bears resemblance to "Tokyo" and "Ankara," without explicitly being programmed with such direct associations. This capacity emerges from the model's ability to discern and encode the intricate relationships between tokens during the training process.

An illustration of the spatial properties of word2vec.

Large language models (LLMs) leverage tokens, embeddings, and vectors as the fundamental building blocks to comprehend and generate text. These models possess the capability to discern and encode intricate relationships within language, enabling them to produce coherent and contextually relevant text outputs. Their abilities extend beyond mere text generation, encompassing question answering, information summarization, and even engaging in creative writing tasks. By harnessing the rich representations captured through embeddings and vectors, LLMs can navigate the nuances and complexities of language, facilitating a wide range of natural language processing tasks with remarkable fluency and aptitude.

Few examples of LLMs: GPT-3 (Generative Pre-trained Transformer 3), BERT (Bidirectional Encoder Representations from Transformers), T5 (Text-to-Text Transfer Transformer), etc.

We will dig deep into LLMs in upcoming articles

Diffusion models

Diffusion models represent a groundbreaking deep learning architecture that takes an unconventional approach to generating outputs. Unlike traditional methods, these models initiate the process with pure noise or random data. Through an iterative process, they gradually infuse this noise with increasing amounts of meaningful information, ultimately culminating in a clear and coherent output, such as an image or a piece of text.

The learning process for diffusion models unfolds in two distinct phases: forward diffusion and reverse diffusion.

During forward diffusion, the model introduces noise into the input data, progressively obscuring its clarity.

Image taken from : Calibrant

Conversely, in the reverse diffusion phase, the model learns to reconstruct the original data by systematically removing the added noise, effectively reversing the forward diffusion process.

Image taken from: Calibrant

This innovative approach has demonstrated remarkable success in various domains, including image synthesis, text generation, and potentially many other applications yet to be explored. By starting from a state of complete noise and gradually refining the output, diffusion models offer a unique perspective on the generative modeling paradigm, paving the way for exciting advancements in artificial intelligence.

Few examples of Diffusion Models : DALL-E 2, Stable Diffusion, Imagen, CLIP-Guided Diffusion, etc.

We will dig deep into Diffusion Models in upcoming articles

Multimodal models

Transcending the boundaries of traditional models that solely operate on a single data modality, such as text or images, multimodal models herald a new era of versatility. These cutting-edge models possess the remarkable ability to seamlessly process and generate multiple modes of data simultaneously, unlocking a realm of innovative possibilities.

As an illustration, consider a multimodal model that can ingest both an image and accompanying text as input, subsequently generating a novel image along with a descriptive caption as output. This remarkable capability stems from the model's profound understanding of the intricate connections and interdependencies between different modalities like images and text, enabling them to influence and inform one another.

Image by Hariri Walid

The applications of multimodal models are vast and diverse, ranging from automated video captioning to the generation of graphics based on textual instructions. Furthermore, these models can enhance question-answering systems by intelligently combining textual and visual information, providing more comprehensive and contextualized responses. They even hold the potential to revolutionize content translation, preserving the integrity of relevant visuals while accurately translating the accompanying text.

As we embark on this new frontier of multimodal modeling, we are poised to witness a paradigm shift in how we interact with and leverage information across various modalities, paving the way for unprecedented levels of automation, creativity, and seamless integration of diverse data forms.

Few examples of Multimodal Models : ALBEF (Unified Vision-Language Pre-training), ViLT (Vision-Language Transformer), ViLBERT (Vision-Language BERT), etc.

We will dig deep into Multimodal Models in upcoming articles

Other generative models

There are several types of generative models used in ML and AI. But above mentioned are most commonly used nowadays.

Generative Adversarial Networks (GANs) are a revolutionary class of generative models that employ two competing neural networks in a strategic adversarial framework, akin to a zero-sum game. This intricate interplay involves two distinct networks: the generator and the discriminator.

The Generator Network: This network acts as an artificial creative force, taking random noise as input and transforming it into synthetic data that closely mimics the characteristics of the training data distribution. The generated data can take various forms, such as images, text, or audio, depending on the specific application.

The Discriminator Network: This network plays the role of a discerning critic, tasked with analyzing both the real data from the training set and the synthetic data generated by the generator. Its objective is to accurately distinguish between the genuine data and the artificially generated samples.

Image from : paperswithcode

During the training process, an intricate dance unfolds as the generator continuously strives to generate data that can convincingly deceive the discriminator into mistaking it for real data. Simultaneously, the discriminator relentlessly hones its ability to correctly classify the real and generated data.

This adversarial dynamic persists until a remarkable equilibrium is reached, where the generator produces data so authentic and indistinguishable from the real data that even the discriminator, with its finely tuned senses, is unable to differentiate between the two.

Through this iterative process of competition and refinement, GANs unlock the potential to generate highly realistic and diverse data, pushing the boundaries of what is achievable in various domains, from computer vision to natural language processing and beyond

Variational Autoencoders (VAEs) are a class of generative models that ingeniously combine the principles of autoencoders (a type of neural network) with the techniques of variational inference, a powerful tool from the realm of Bayesian statistics. These models comprise two distinct components:

The Encoder: This neural network takes the input data (such as an image) and maps it to a lower-dimensional latent space, effectively capturing the essential features and characteristics of the data in a compressed representation.

The Decoder: This complementary neural network takes the latent representation produced by the encoder and generates a reconstruction of the original input data, effectively reversing the encoding process.

Image from : wikipedia

The key innovation of VAEs lies in the way they encourage the latent space to follow a specific probability distribution, typically a Gaussian distribution. This remarkable property enables the generation of new data by simply sampling from the latent space and passing these samples through the decoder network.

By constraining the latent space to adhere to a known distribution, VAEs unlock a realm of possibilities. Not only can they accurately reconstruct input data, but they also possess the ability to generate entirely new and diverse data samples by exploring the latent space in a principled manner.

This powerful combination of autoencoders and variational inference techniques has opened up new avenues in various domains, allowing for the generation of synthetic data that captures the essential characteristics of the original data distribution, while also enabling the exploration of novel variations and permutations within the latent space.

In Introduction to Generative AI - Part III we will try to understand what are the different types of Optimizing model outputs techniques



要查看或添加评论,请登录

社区洞察

其他会员也浏览了