Vector Embedding : UnSung Hero

Vector Embedding : UnSung Hero

I am writing this article to explain what I have learnt about vector embedding and how this is one of the first stepping stone when you are entering the world of llm , nlp etc. Let's get right into it.

Step 1 : Premise [ What did we miss to ask ?]

one of the first algorithm we learn is linear regression model. on very simple terminologies LR is used to fit a line to predict y given x. So how the algorithm work : we come up with a linear hypothesis like y = a1*x+ b1 a1 and b1 are unknown. we define a error / loss function and iteratively we try to minimise the error by changing the a1 and b1.

Similarly we move forward with classification problems , house price prediction , image classification etc.

Among all these problems we are solving with machine learning have one common property , input and output data are mostly numerical. But as soon as you enter the realm of Natural Language Processing the dataset becomes unclear. what do we feed to the regression model , neural network model etc. So What did we miss to ask ?


STEP 2 : The Question !!

Given an human spoken language like English , Hindi , bhojpuri , Nepali , Spanish. Before solving bigger problems or subset problem like 1. complete the sentence 2. Translate it into Spanish 3. Fill in the blank etc, we should ask ?

How do we represent the data ?

Naive approach: we can come up with simple rule each word of the dictionary will be given a number based on its index. Now i can represent the sentence like "I am happy" with number like [ 12 , 23, 97 ] assume they are the index . This approach is naive approach and the value assigned to each number doesn't have any meaning to it. Car and Cat have closer numerical representation but the engine makes more sense with car. So there is no semantic meaning being captured. Also Numbers are not normalised.


one-hot encoding :

Another way of representing the words could be one hot vector encoding which is basically for each word you will created a zero vector of size equal to vocabulary size and then you will put one in the place of it's index. Now sentence like " I am happy" will become like [ vocab_size x words_count ]. This representation is little improvement on the naive approach as vector representation makes it easier to be used in algorithms but it still doesn't capture the semantic meaning of the word.


STEP 3 : What to do ?

Boo ::ghost-face:: !!!! If you have read this so far , keep reading by the end of this post you will have huge respect for vector embedding.

We haven't asked the actual question yet. What do we mean by semantic meaning. If somehow i can quantify that should solve the problem. Can we define the semantic meaning of word. I think that question can be dealt in a vary philosophical sense but in our context we can do some kind of hack. Instead of learning what a word / group of symbols mean. we can define when do we say two group of symbols have similar meaning.

ex : I am doing good , I am doing great . we can say "good" and "great" can have similar meaning. I don't need to know what it actually means but the similarity is enough to establish they have similar semantic meaning. Great !

I am going to take a step back and re-iterate the same statement on meta sense like given any unknown language which has symbols to represent the words. If given huge amount of corpus I can establish the similarities between words. which means capturing the semantic and in future if i can decode one word the similar words can be approximated to have similar meanings. Yuhoo !! You can talk to aliens now just ask them to send their alienpedia link :D.



STEP 4 : Are we done asking questions ?

No ! You saw it coming didn't you. we will take another step back and say. Any kind of data / information is embedded and can be represented in vector form which can represent/captures its meaning. once we have that you can create complex models to solve higher level problems like QA Chat bot , translation tasks etc. So how it is done ? Vector Embedding.

Text corpus :

text = '''Machine learning is the study of computer algorithms that \
improve automatically through experience. It is seen as a \
subset of artificial intelligence. Machine learning algorithms \
build a mathematical model based on sample data, known as \
training data, in order to make predictions or decisions without \
being explicitly programmed to do so. Machine learning algorithms \
are used in a wide variety of applications, such as email filtering \
and computer vision, where it is difficult or infeasible to develop \
conventional algorithms to perform the needed tasks.'''         

Tokenisation :

import re

def tokenize(text):
    pattern = re.compile(r'[A-Za-z]+[\w^\']*|[\w^\']*[A-Za-z]+[\w^\']*')
    return pattern.findall(text.lower())

tokens = tokenize(text=text)
print(f" token type {type(tokens)}  length of tokens {len(tokens)}")
print(f"sample tokens : {tokens[0:4]}")        

LookUp Table

# this function is to create lookup tables for all the tokens

def mapping(tokens):
    word_to_id = {}
    id_to_word = {}

    for index, token in enumerate(set(tokens)):
        word_to_id[token] = index
        id_to_word[index] = token

    return word_to_id, id_to_word

# lookup tables created 
word_to_id, id_to_word = mapping(tokens=tokens)

from pprint import pprint
pprint(f" word_to_id {word_to_id}  id_to_word {id_to_word}")
pprint(f" {len(word_to_id)} = UNIQUE TOKENS = { len(set(tokens))}")        

Generating Dataset

import numpy as np
np.random.seed(42) # for reproducibility
from pprint import pprint


def one_hot_encode(id, vocab_size):
    res = [0] * vocab_size
    res[id] = 1
    return res
        

def concat(*iterables):
    for iterable in iterables:
        yield from iterable    


def generate_training_data(tokens, word_to_id, window):
    X = []
    y = []
    train = []

    n_tokens = len(tokens)
    
    for i in range(n_tokens):
        idx = concat(
            range(max(0, i - window), i), 
            range(i, min(n_tokens, i + window + 1))
        )
        for j in idx:
            if i == j:
                continue
            X.append(one_hot_encode(word_to_id[tokens[i]], len(word_to_id)))
            train.append([tokens[i], tokens[j]])
            y.append(one_hot_encode(word_to_id[tokens[j]], len(word_to_id)))
    
    pprint(train)
    return np.asarray(X), np.asarray(y)

X, y = generate_training_data(tokens, word_to_id, 2)
print(f"X {X.shape} y {y.shape}")        

vocab_size = 60 # we have 60 dimensional hot encoded vectors
n_embedding = 5 # dimension of embedding vector 

torch.manual_seed(42)

class MyEmbeddingModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.in_embedding = nn.Linear(vocab_size,n_embedding)
        self.out_embedding = nn.Linear(n_embedding, vocab_size)
        
    def forward(self,x):
        return self.out_embedding(self.in_embedding(x))
    
model =  MyEmbeddingModel()

# model.in_embedding(torch.tensor(X[0],dtype=torch.float32))
# model.forward(torch.tensor(X[0],dtype=torch.float32))

# Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training the model
epochs = 1000
losses = []

X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)


for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
    print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item()}')
    
        




要查看或添加评论,请登录

Sriram Kumar的更多文章

社区洞察

其他会员也浏览了