Preliminary Machine Learning Concepts
Vinay Ananth R.
Empowering businesses with innovative solutions | Sales | Generative AI & ML | IoT/ IIoT | Cloud | Presales | Product Owner
Learn about neural network architecture, its types, and the key concepts of transformers. Get an understanding of how these concepts apply in GenAI systems.
We'll cover the following
·???????? Neural network architecture
·???????? Convolutional neural networks
·???????? Recurrent neural network
·???????? Transformer network
o??? Positional encoding
·???????? Takeaway
Mastering the core principles of neural networks and their variants is crucial for designing large-scale GenAI systems capable of tasks like text, image, speech, and video generation. In this lesson, we will explore the following foundational concepts:
These machine-learning concepts are the backbone of modern GenAI systems, which allow machines to learn patterns, generate creative outputs, and scale efficiently. By understanding these concepts, we can better design and optimize the complex System Designs required for real-world GenAI applications.
Let’s describe each of the above concepts, starting with neural networks.
Neural network architecture
Neural networks?are computational models inspired by the human brain. They are designed to recognize patterns and make predictions by processing data through interconnected layers (discussed below) of nodes (also known as neurons).?Neural network architecture?refers to the structure and organization of a neural network, including the arrangement of its layers, nodes (neurons), and connections. It defines how the data flows through the network, learns, and makes predictions or decisions.
Let’s discuss the essential components of a neural network.
Components of neural network
Here are the key components of neural network architecture, though we will focus on only a few in this discussion:
output=σ(Bias+∑i=1mxiwi)=σ(Bias+x1w1+x2w2+...+xmwm)
output=σ(Bias+i=1∑mxiwi)=σ(Bias+x1w1+x2w2+...+xmwm)
Where:
The architecture of a simple neural network is provided below:
?Neural networks are the foundation of all modern, complex AI systems, and every advanced architecture, from deep learning to transformers, evolves from these fundamental concepts. Their ability to learn, adapt, and generate patterns powers today’s most cutting-edge GenAI technologies.
Let’s discuss an advanced neural network architecture called a convolutional neural network, which is tailored for processing structured or image data.
Convolutional neural networks?
Convolutional neural networks (CNNs)?are specialized neural networks designed for processing structured data, particularly images. They are used in tasks like video analysis, speech recognition, and natural language processing, which we will discuss in further lessons. Their architecture is inspired by the visual processing system of the human brain, enabling them to extract features from raw data automatically.
Key components of CNNs
CNNs mimic how the human visual system processes images, where the brain identifies edges and textures first before understanding the whole picture. This biologically inspired design is why CNNs can generate photorealistic images or create entirely new artistic styles, bridging the gap between human creativity and GenAI.
Recurrent neural network
A?recurrent neural network (RNN)?is a class of artificial neural networks that contain loops within its hidden layers and allow information to persist within the network over time. RNNs are designed to process sequential data by maintaining a memory of previous inputs, making them particularly effective for tasks involving temporal or contextual dependencies.
RNN is widely used in applications such as natural language processing (e.g., text sequencing in conversational AI), generating descriptive text in text-to-image models, and handling time series data in various predictive systems.
Unlike traditional neural networks, which utilize a simple feedforward flow, recurrent neural networks contain a looping mechanism that computes an internal state update with each time step. This allows an RNN to retain information about preceding elements in a data series.
Point to Ponder
Question
How do RNN loops enable sequential data handling, and why are they widely used in language models, speech processing, and time series analysis?
Show Answer
While RNNs are effective for sequential data, their limitations in handling long-term dependencies and parallel processing have led to the development of more advanced architectures, such as transformer networks, which revolutionize sequence modeling with attention mechanisms. Let’s discuss the transformer network in the following section.
Transformer network
Transformers are a deep learning model that handles sequential data, such as text. They use a self-attention mechanism to capture the relationship between words in a sequence. They are the backbone of many NLP models. The transformer network (model), introduced in the paper "Attention Is All You Need," was presented in 2017. The following figure demonstrates its architecture.
???????????????????????
The transformer model mainly consists of the following different steps:
领英推荐
Tokenization and input encoding
This step converts each word, called a?token, into a vector of fixed length—for instance, 512. Like text-to-text models, it dissects input words into smaller units through tokenization. Each token is then translated into initial embeddings, providing a numerical representation for every input fragment. This step is crucial for enabling the model to work with the intricacies of language.
Consider the sentence, “Mysterious footsteps echoed in the silent forest,” which has a?dimension?of 7 (words). In the given sentence, each word is considered a token. Also, each word is converted to a fixed-length vector, i.e., 512.
All the numbers in this lesson are randomly generated for illustration purposes. However, these numbers can be generated using predefined encoders for actual model training.
Positional encoding
Positional encoding?recognizes the significance of word order and captures the spatial information of each word in a sentence. Without positional encoding, the GenAI system might consider different permutations of the same words as equivalent, leading to potential confusion. For example,?“The sun sets behind the mountain”?and?“The mountain sets behind the sun”?would have the same representation without positional encoding.?
Positional encoding ensures that the GenAI systems comprehend the semantics of words and their positions within the input sequence, preserving the temporal shades of language.
The positional encoding of each word is also a vector of size 512, which is added to the corresponding embedding vectors of each token, as illustrated below:
After adding the embedding and position encoding vectors, the result is provided as input to the attention module.
The attention mechanism?
The?attention mechanism?in the transformer model captures long-range dependencies and generates a context-aware representation for each token in the sequence based on its relationships with other tokens. It emphasizes the importance of each token to the others.
For example, consider the following two sentences:
We can easily understand that “it” refers to the glass in the first sentence and the jug in the second. However, machine learning models identify this relationship of words using the attention mechanism.
The transformer model used a multi-head attention mechanism. However, to understand it, we first need to have an in-depth understanding of the self-attention mechanism.
Self-attention
The?self-attention mechanism?computes the importance of different words in a single sequence with each other.
We assume our previous example, where?dsequence=7dsequence=7?and?dmodel=512dmodel=512. Self-attention is computed using the following formulation:
Attention(Q,K,V)=softmax(QKTdmodel)×VAttention(Q,K,V)=softmax(dmodelQKT)×V
The result of self-attention is a?dsequence×dmodeldsequence×dmodel?matrix that represents how much attention each position in a sequence gives to other positions.
Terminology alert:
The attention mechanism operates queries consolidated into a matrix?QQ.The keys and values are also grouped into matrices?KK?and?VV, respectively. The dimension of each of these matrices is?dsequence×dmodeldsequence×dmodel, where?dsequence=7dsequence=7?and?dmodel=512dmodel=512?for the input sentence “Mysterious footsteps echoed in the silent forest.” To understand how these matrices are initially created, refer to this Educative answer about?the intuition behind the dot product attention.
The softmax function generates similarity scores of each word with other words within the range of 0 to 1 (probability values), as depicted below:
??????????????????????????????????????????????????????
Multi-head attention
Multi-head attention?enables the model to capture different aspects or patterns in the relationships between words, enhancing its ability to learn diverse and complex dependencies. It extends self-attention by running it in parallel multiple times.
The inputs (QQ,?KK, and?VV) are linearly transformed into multiple subsets. Each input is processed independently through several self-attention blocks called heads. For example, if we consider eight heads (hh), the input dimension to each head would be?dmodelh=5128=64hdmodel=8512=64. Let’s denote this value by?dkdk.
Let’s understand the working of multi-head attention in different steps:
Headi(QR,KR,VR)=softmax(QiR(KiR)Tdk)×ViRHeadi(QR,KR,VR)=softmax(dkQiR(KiR)T)×ViR
??Here,?ii?represents a subset of each matrix bearing the dimension?7×647×64?.
The process is illustrated below:
????????????????????????????????????????????????????
The purpose of the multi-head attention mechanism in conversational and other models (as discussed in this course) is to enhance the model’s capacity to capture diverse patterns, relationships, and context within the input sequence. Instead of depending on a single attention mechanism, multi-head attention enables the model to focus on various parts of the input sequence by utilizing multiple sets of attention weights, each focusing on different aspects.
Let’s suppose we’re using two-head attention for the following sentence:
??“She poured milk from the jug into the glass until it was empty.”
We might expect the following visualization of the output. For the query word “it,” the first head (colored blue) focuses on the words “the jug,” while the second head (colored brown) focuses on the words “was empty.” Therefore, the ultimate context representation will center around the words “the,” “jug,” and “empty,” making it a more advanced representation than the conventional approach.
The attention mechanism is like a guiding light that helps it understand and respond coherently in conversations like ChatGPT. This technology turns language complexities into something smart algorithms can handle.
Cross-attention
Cross-attention is a mechanism that allows one set of data (query) to focus on and relate to another set of data (key-value pair). It’s like highlighting the parts of one conversation most relevant to the other, ensuring the two sides make sense together.
Where:
Self-attention operates within a single sequence, helping each token understand its relationship with the others. In comparison, cross-attention connects two sequences, namely, query and key-value pairs. For example, when translating the sentence “She poured milk into the glass” into French as “Elle a versè du lait dans le verre.”
First, the encoder processes the source sentence and generates the embeddings that capture the contextual meaning of each word. As the decoder generates the target sentence, it uses cross-attention to focus on relevant parts of the source sentence at each step. For instance, when generating “Elle,” the attention focussed on “She” in the source, identifying the subject. Similarly, for “a versé,” the attention shifts to “poured,” ensuring the correct verb conjugation is used in French. When producing “du lait,” the model focuses on “milk,” mapping the object accurately. Finally, “dans le verre” aligns with “into the glass.” translating the prepositional phrase fluently.
The feedforward network:?In conversational and other models discussed in this course, the feedforward network refines information from the attention mechanism. It processes each input position independently with linear transformations, followed by ReLU activation and layer normalization for stability. This helps the model capture complex relationships and adapt to diverse patterns. This ability enhances contextual understanding and response generation, improving the relevance and coherence of outputs. Whether in natural language conversations, image creation, or video generation.xxxxxxx
Takeaway
Neural network architecture forms the foundation of modern AI systems, enabling them to process complex data and make intelligent predictions. Transformers are pivotal in the System Design of conversational, text-to-video, and text-to-speech AI models, driving advancements in natural language understanding, multimedia processing, and response generation.