Decoding the Transformers: A Dive into GPT with TensorFlow
Krishna Chaitanya Kosaraju, Ph. D.
Mathematician and Machine Learning Expert | AI/ML Engineer | Data Scientist | NLP | LLM
The entire code for this guide, as well as some additional material, is available on the GitHub repository.
Transformers are a type of deep learning model that was introduced in the paper "Attention is All You Need" by Vaswani et al. (2017). They have been particularly successful in natural language processing tasks, and are behind state-of-the-art models such as GPT (Generative Pretrained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and others.
Transformers are based on the concept of self-attention, also known as transformer attention or scaled dot-product attention. Unlike traditional recurrent neural networks (RNNs) that process input sequences one element at a time, transformers process all elements of the input sequence in parallel, which allows them to learn long-range dependencies more effectively.
In order to achieve our goal, we're drawing inspiration from a comprehensive video series by Andrej Karpathy , where he skillfully constructs a GPT model from the ground up using PyTorch. However, our endeavor charts a slightly different course. We've opted to use TensorFlow, a robust open-source framework extensively used for machine learning applications, instead of PyTorch. To be more precise, we will be utilizing his nanoGPT repository built in PyTorch.
Nomenclature
Before we delve deeper, it's essential to familiarize ourselves with certain terminology that will be consistently used throughout this article. Understanding these terms will not only simplify the process but also enhance comprehension.
We will capture all this inside the GPTconfig class.
@dataclas
class GPTConfig:
? ? block_size: int = 25
? ? vocab_size: int = 200 ?# GPT-2 vocab_size of 50257, padded up to the nearest multiple of 64 for efficiency
? ? n_layer: int = 12 # number of squential transformers
? ? n_head: int = 12 ?# number of attention heads
? ? n_embd: int = 768 ?# embedding size of the input
? ? dropout: float = 0.2 # dropout percentage
? ? bias: bool = True ?# True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
? ? epsilon: float = 1e-5 ?# epsilon value of layer normalizations
Dataset
Our next objective is to procure the dataset we will be using for this project. For this purpose, we're following Andrej's footsteps and using the 'Tiny Shakespeare' dataset. We will download this dataset, encode it into numerical form, and then convert it into a batched dataset leveraging TensorFlow's in-built functionality. Encoding our data into a numerical format is crucial as it facilitates the computer's understanding and processing of the information. Here's the detailed breakdown of the provided code.
Function to download the dataset
# Function to download the datas
def text_extractor(url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"):
??? # Request to fetch the tiny shakespeare dataset
??? response = requests.get(url)
??? # Checking if we got a valid response
??? if response.status_code == 200:
??????? # Opening a file and writing the content of the response
??????? with open('input.txt', 'w') as file:
??????????? file.write(response.text)
??? else:
??????? print(f"Failed to get file with status code: {response.status_code}")
??? # Reading the downloaded file
??? with open('input.txt', 'r', encoding='utf-8') as f:
??????? text = f.read()
??? return texte
Function to encode the text into numbers
# Function to encode the text into number
def text_encoder(text):
# Listing and sorting the unique characters in the text
chars = sorted(list(set(text)))
# Getting the total number of unique characters
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)
# Creating mappings from characters to their corresponding numerical representations
stoi = {ch:i for i, ch in enumerate(chars)}
# Creating mappings from numbers to their corresponding characters
itos = {i:ch for i, ch in enumerate(chars)}
# Function to encode a string into a list of numbers
encode = lambda s: [stoi[ch] for ch in s]
# Function to decode a list of numbers back into a string
decode = lambda l: "".join([itos[i] for i in l])
print(encode("hii I am Krishna"))
print("decoded: ", decode(encode("hii I am Krishna")))
# Encoding the entire text into numbers
series = encode(text)
n = int(0.8*len(series))
return series
Function to create a windowed dataset
# Function to create a windowed datase
def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
# Creating a tensorflow dataset from the encoded series
dataset = tf.data.Dataset.from_tensor_slices(series)
# Creating a windowed dataset with each window of size window_size + 1 and shifting the window by 1 after each step
dataset = dataset.window(size=window_size+1, shift = 1, drop_remainder=True)
# Flattening the dataset
dataset = dataset.flat_map(lambda window: window.batch(window_size+1))
# Splitting each window into features (all elements except the last) and target (the last element)
dataset = dataset.map(lambda x: (x[:-1], x[1:]))
# Shuffling the dataset
dataset = dataset.shuffle(shuffle_buffer)
# Batching the dataset and prefetching 1 batch at a time to improve performance
dataset = dataset.batch(batch_size).prefetch(1)
return dataset
Remember, the process of converting the text into numerical form is crucial because machine learning models do not understand text data in its raw form. Therefore, encoding the text data into numbers allows our model to comprehend and learn from the text data. Following the data extraction and encoding process, we can proceed to generate the training and testing data. Here's how to do it:
config = GPTConfi
text = text_extractor()
series = text_encoder(text)
n = len(series)
# Create the training dataset
train_dataset = windowed_dataset(series[:n], config.block_size, batch_size=250, shuffle_buffer=10)
# Create the testing dataset
test_dataset = windowed_dataset(series[n:], config.block_size, batch_size=250, shuffle_buffer=10)
In this script, we start by initializing our GPT configuration object. We then extract the text data from the chosen dataset and encode this text data into numerical form. After getting the encoded data (series), we calculate its length. Next, we create the training dataset by taking the first n elements from the series. We specify the block size as defined in our GPT configuration, set a batch size of 250, and set a shuffle buffer size of 10. The shuffle buffer size determines the randomness of the shuffling process, which helps improve model generalization. The test dataset is created in a similar fashion, but this time, we take the elements from position n to the end of the series. By running this script, we will have our training and testing datasets ready for the model training phase.
Attention Mechanism
The attention mechanism forms the crux of the GPT model and many modern Transformer-based architectures. The goal of attention is to compute a context-aware representation of each word in our input sequence. Let's break down the steps involved in this process for better understanding:
These steps are carried out for each word in the input sequence, and the process is repeated for each layer in the Transformer model. Through these layers, the model can learn to focus on different words at each layer, capturing more complex relationships and building a rich understanding of the input.
Code Walk Though: The MultiHeadAttention class is an integral part of the GPT model architecture. It helps achieve the attention mechanism we explained earlier but with a twist - it creates multiple attention networks in parallel. This allows the model to capture different types of information from the same input data.
class MultiHeadAttention(layers.Layer)
? ? def __init__(self, config):
? ? ? ? super(MultiHeadAttention, self).__init__()
? ? ? ? self.num_heads = config.n_head
? ? ? ? self.head_size = config.n_embd // config.n_head
? ? ? ? # Projecting input into key, query, and value for all attention heads, but in batch
? ? ? ? self.c_attn = layers.Dense(3 * config.n_embd, use_bias=config.bias)
? ? ? ? # Regularization
? ? ? ? self.attn_dropout = layers.Dropout(config.dropout)
? ? ? ? self.resid_dropout = layers.Dropout(config.dropout)
? ? def call(self, x):
? ? ? ? B, T, C = x.shape
? ? ? ? # Linear transformation for queries, keys, and values, note that C = n_embd
? ? ? ? qkv = self.c_attn(x) ?# Input shape: (B, T, C), Output shape: (B, T, 3 * n_embd)
? ? ? ? # Split the queries, keys, and values
? ? ? ? q, k, v = tf.split(qkv, 3, axis=-1) ?# Input shape: (B, T, 3 * n_embd), Output shapes: 3 * (B, T, n_embd)
? ? ? ?
? ? ? ?
? ? ? ? # Reshape queries, keys, and values for multi-head attention with head_size = n_embd // num_heads
? ? ? ? # BUG: possible issue with tensorflow, you can use tf.reshape(q, (B, T, self.num_heads, -1)), for tensorflow B is unknown: it will give an error
? ? ? ? q = tf.reshape(q, (-1, T, self.num_heads, self.head_size)) ?# Output shape: (B, T, num_heads, head_size)
? ? ? ? k = tf.reshape(k, (-1, T, self.num_heads, self.head_size)) ?# Output shape: (B, T, num_heads, head_size)
? ? ? ? v = tf.reshape(v, (-1, T, self.num_heads, self.head_size)) ?# Output shape: (B, T, num_heads, head_size)
? ? ? ? # Perform attention operations
? ? ? ? # Transpose queries, keys, and values for efficient matrix multiplication
? ? ? ? q = tf.transpose(q, perm=[0, 2, 1, 3]) ?# Output shape: (B, num_heads, T, head_size)
? ? ? ? k = tf.transpose(k, perm=[0, 2, 1, 3]) ?# Output shape: (B, num_heads, T, head_size)
? ? ? ? v = tf.transpose(v, perm=[0, 2, 1, 3]) ?# Output shape: (B, num_heads, T, head_size)
? ? ? ? # Compute attention scores ("affinities")
? ? ? ? wei = tf.matmul(q, k, transpose_b=True) * (self.head_size ** -0.5) ?# Output shape: (B, num_heads, T, T)
? ? ? ? mask = tf.linalg.band_part(tf.ones_like(wei), -1, 0) ?# Lower triangular matrix of ones
? ? ? ? wei = tf.where(mask == 1, wei, float("-inf")) ?# Set upper triangular part to -inf
? ? ? ? wei = tf.nn.softmax(wei, axis=-1) ?# Output shape: (B, num_heads, T, T)
? ? ? ? wei = self.attn_dropout(wei) ?# Regularization step 1
? ? ? ? # Perform the weighted aggregation of the values
? ? ? ? out = tf.matmul(wei, v) ?# Output shape: (B, num_heads, T, head_size)
? ? ? ? # Transpose and reshape the output to match the original shape
? ? ? ? out = tf.transpose(out, perm=[0, 2, 1, 3]) ?# Output shape: (B, T, num_heads, head_size)
? ? ? ? out = tf.reshape(out, (-1, T, C)) ?# Output shape: (B, T, C) - note that C = num_heads * head_size = n_embd
? ? ? ? out = self.resid_dropout(out) ?# Regularization step 2
? ? ? ? return out:
Here's how the code works:
By running this process in parallel for multiple attention heads, the model can focus on different features in the input data simultaneously, allowing it to better understand complex patterns in the data.
Final Missing pieces: MLP and Block
The MLP (Multi-Layer Perceptron) and Block classes in the code are vital components of the GPT model architecture. They help the model to learn complex representations and capture dependencies in the data. Let's take a closer look at both of them:
领英推荐
MLP Class:
class MLP(layers.Layer)
? ? def __init__(self, config):
? ? ? ? super(MLP, self).__init__()
? ? ? ? n_embed = config.n_embd
? ? ? ? self.c_fc = layers.Dense(4 * n_embed, use_bias=config.bias, activation=tf.keras.activations.gelu)
? ? ? ? self.c_proj = layers.Dense(config.n_embd, use_bias=config.bias)
? ? ? ? self.dropout = layers.Dropout(config.dropout)
? ? def call(self, x):
? ? ? ? x = self.c_fc(x)
? ? ? ? x = self.c_proj(x)
? ? ? ? x = self.dropout(x)
? ? ? ? return x:
The MLP class is essentially a simple feed-forward neural network with one hidden layer and non-linear activation. It helps the model to learn and extract features from the inputs.
Block Class:
class Block(layers.Layer)
? ? def __init__(self, config):
? ? ? ? super(Block, self).__init__()
? ? ? ? # Layer normalizing the input data as the number of features increases over time
? ? ? ? self.ln_1 = layers.LayerNormalization(epsilon=config.epsilon, center=False, scale=True)
? ? ? ? self.attn = MultiHeadAttention(config)
? ? ? ? self.ln_2 = layers.LayerNormalization(epsilon=config.epsilon, center=False, scale=True)
? ? ? ? self.mlp = MLP(config)
? ? def call(self, x):
? ? ? ? # 1. Input data is layer normalized: Layer normalizing the input data as the number of features increases over time
? ? ? ? x_normalized = self.ln_1(x)
? ? ? ? # 2. Fed through the attention network: We get the attention scores or weighted values
? ? ? ? attn_output = self.attn(x_normalized)
? ? ? ? # 3. Added to the input: Reduces vanishing gradient issues
? ? ? ? x = x + attn_output
? ? ? ? # 4. Layer normalized the data again
? ? ? ? x_normalized = self.ln_2(x)
? ? ? ? # 5. Final pass through a multi-layer perceptron: We are learning the features
? ? ? ? mlp_output = self.mlp(x_normalized)
? ? ? ? # 6. Added to the input again
? ? ? ? x = x + mlp_output
? ? ? ? return x:
The Block class is essentially the building block of the Transformer architecture in the GPT model. It contains a multi-head self-attention mechanism followed by a position-wise feed-forward network (MLP).
These classes are the core building blocks of the GPT model, enabling it to process and learn from sequence data effectively.
Decoder Model
This segment of code outlines the construction of our GPT model, defining the structure and connections between the multiple components that make up the model's architecture. Here, we've harnessed the flexibility and modularity of the Keras API to construct a complex model with relative ease.
ef decoder(config)
? ? """
? ? Creates an decoder model based on the provided configuration.
? ? Args:config: An object specifying the configuration parameters.
? ? Returns:decoder: A Keras Model object representing the encoder model.
? ? """
? ? # create a dict with all the layers we need
? ? transformer_dict = {
? ? ? ? # input layer
? ? ? ? 'input': tf.keras.Input(shape=(config.block_size,)),
? ? ? ? # word token embeddings
? ? ? ? 'wte': tf.keras.layers.Embedding(config.vocab_size, config.n_embd, input_length=config.block_size),
? ? ? ? # word position embeddings
? ? ? ? 'wpe': tf.keras.layers.Embedding(config.block_size, config.n_embd),
? ? ? ? # dropout layer
? ? ? ? 'drop': tf.keras.layers.Dropout(config.dropout),
? ? ? ? # Transformer blocks
? ? ? ? 'h': tf.keras.Sequential([Block(config) for _ in range(config.n_layer)]),
? ? ? ? # layer normalization
? ? ? ? 'ln_f': tf.keras.layers.LayerNormalization(epsilon=config.epsilon, center=False, scale=True),
? ? ? ? # layer used to project the output of the GPT model to the vocabulary size
? ? ? ? 'lm_head': tf.keras.layers.Dense(config.vocab_size, use_bias=False)
? ? }
? ? # input
? ? idx = transformer_dict['input']
? ? pos = tf.range(0, config.block_size, dtype=tf.int64) ?# shape (t)
? ? # Forward the GPT model itself
? ? tok_emb = transformer_dict['wte'](idx) ?# token embeddings of shape (b, t, n_embd)
? ? pos_emb = transformer_dict['wpe'](pos) ?# position embeddings of shape (t, n_embd)
? ? x = transformer_dict['drop'](tok_emb + pos_emb)
? ? for block in transformer_dict['h'].layers:
? ? ? ? x = block(x)
? ? x = transformer_dict['ln_f'](x)
? ? # logit scores for each vocabulary word at each position in the input sequence.
? ? logits = transformer_dict['lm_head'](x) ?# shape (batch_size, sequence_length, vocab_size)
? ? # Create encoder model
? ? model = tf.keras.Model(inputs=idx, outputs=logits, name='encoder')
? ? return model
Now, let's understand the purpose and function of each element of this code in detail:
In conclusion, once defined, we can proceed to train our GPT model with the given input text data and use it to generate contextually relevant and coherent text sequences. Each component in the model serves a distinct role in processing the input data, identifying dependencies among words, and producing the output sequence.
Training
if __name__ == '__main__'
? ? config = GPTConfig
? ? text = text_extractor()
? ? series = text_encoder(text)
? ? n = len(series)
? ? train_dataset = windowed_dataset(series[:n], config.block_size, batch_size=250, shuffle_buffer=10)
? ? test_dataset = windowed_dataset(series[n:], config.block_size, batch_size=250, shuffle_buffer=10)
? ? # Create the decoder model
? ? decoder_model = decoder(config)
? ? # Compile and train the model
? ? optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
? ? loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
? ? epochs = 10
? ? decoder_model.compile(optimizer=optimizer, loss=loss_fn)
? ? history = decoder_model.fit(train_dataset, epochs=epochs, validation_data=test_dataset):
The provided code snippet demonstrates how to train the GPT model with our dataset:
Please make sure to modify the batch size, epochs, learning rate, and other parameters based on your specific data size, computational capacity, and requirements.
Conclusion
In conclusion, the Generative Pretrained Transformer (GPT) model, developed by OpenAI, is a potent tool for a wide array of tasks that involve generating human-like text. Through this guide, we have gone step by step through the construction of the GPT model, from the foundation of transformer blocks to multi-head attention, the MLP, layer normalization, and finally the assembling of these components into a complete model. We have also seen how to train this model using Keras and TensorFlow.
Understanding the inner workings of such a model provides insight into how it manages to generate coherent, contextually relevant text. It's a demonstration of how far we've come in natural language processing and understanding. This model finds its applications in several areas, from chatbots and virtual assistants to automated content generation and programming helpers.
The entire code for this guide, as well as some additional material, is available on the GitHub repository. I would like to extend a massive thank you to Andrey for his effort and dedication in sharing this outstanding resource. His hard work has made it possible for many to understand and implement this powerful model.
The power of GPT lies not just in its complexity but also in the broad applications it promises. It's an exciting time to be involved in AI and machine learning, and models like GPT offer a glimpse into the future of these technologies.
#transformers #machinelearning #nlp #gpt #deeplearning Andrej Karpathy
Researcher
1 年Nice resource Krishna Chaitanya Kosaraju. GPT in tensorflow will be handy for the tensorflow users.