Unlocking Vietnamese-English Machine Translation with PyTorch Transformers (P1- Data Preparation)
Image created by pixlr.com

Unlocking Vietnamese-English Machine Translation with PyTorch Transformers (P1- Data Preparation)

In today's interconnected world, breaking down language barriers is essential for communication and understanding across cultures. With advancements in natural language processing (NLP), powerful tools like transformers have revolutionized machine translation. In this article, we'll delve into how to leverage PyTorch transformers to facilitate Vietnamese-English translation.

Install important packages

import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator        

Prepare the tokenizers

Tokenization is a pivotal initial step in natural language processing (NLP), essential for breaking down text into manageable units. In the provided code snippet, a tokenizer function is established using the get_tokenizer method with the 'basic_english' parameter, indicating its adherence to fundamental English language rules.

For example, "?n qu? nh? k? tr?ng cay" will be tokenize to [12,9,7,6,11,4]

Source: AI Vietnam
Image created by AI Vietnam
# Define tokenizer function
tokenizer = get_tokenizer('basic_english')

# Create a function to yield list of tokens
def yield_tokens(examples):
    for text in examples:
        yield tokenizer(text)

# Tokenize and numericalize your samples
def vectorize_en(text, vocab, sequence_length):
    tokens = tokenizer(text)
    tokens = [vocab[token] for token in tokens] + [vocab["<eos>"]]
    token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
    return token_ids

def vectorize_vn(text, vocab, sequence_length):
    tokens = tokenizer(text)
    tokens = [vocab["<sos>"]] + [vocab[token] for token in tokens] + [vocab["<eos>"]]
    token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
    return token_ids

corpus_en = [
    "good morning",
    "ai books"    
]
data_size_en = len(corpus_en)
        

Create Vocabulary

If we input 2 sentences

  • "good morning", "AI books"

Then we will get the vocabulary for English:

{'morning': 6, 'good': 5, 'books': 4, 'ai': 3, '<eos>': 2, '<pad>': 1, '<unk>': 0}

corpus_en = [
    "good morning",
    "ai books"    
]
data_size_en = len(corpus_en)

# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3         
# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3

vocab_en = build_vocab_from_iterator(yield_tokens(corpus_en),
                                     max_tokens=vocab_size_en,
                                     specials=["<unk>", "<pad>", "<eos>"])
vocab_en.set_default_index(vocab_en["<unk>"])
vocab_en.get_stoi()

# Vectorize the samples
corpus_ids_en = []
for sentence in corpus_en:
    corpus_ids_en.append(vectorize_en(sentence, vocab_en, sequence_length_en))

# print
en_data = torch.tensor(corpus_ids_en, dtype=torch.long)
print(en_data)          

We repeat the same steps for creating the Vietnamese corpus with 2 example sentences:

  • "chào bu?i sáng", "sách ai"

Then we get the results like:

{'sách': 7, 'sáng': 8, 'chào': 6, 'bu?i': 5, '<sos>': 2, 'ai': 4, '<eos>': 3, '<pad>': 1, '<unk>': 0}

corpus_vn = [
    "chào bu?i sáng",
    "sách ai"    
]
data_size_vn = len(corpus_vn)

# max vocabulary size and sequence length
vocab_size_vn = 9
sequence_length_vn = 4        
# Create vocabulary
vocab_vn = build_vocab_from_iterator(yield_tokens(corpus_vn),
                                  max_tokens=vocab_size_vn,
                                  specials=["<unk>", "<pad>", "<sos>", "<eos>"])
vocab_vn.set_default_index(vocab_vn["<unk>"])
vocab_vn.get_stoi()

# Vectorize the samples
corpus_ids_vn = []
for sentence in corpus_vn:
    corpus_ids_vn.append(vectorize_vn(sentence, vocab_vn, sequence_length_vn+1))

# print
print(corpus_ids_vn)        

Create data and label

Shifting right is a fundamental operation used in sequence modelling tasks such as language modelling or sequence prediction. Its primary function is to create input-output pairs, crucial for training supervised learning models.

Consider the sentence: "<SOS> Chào bu?i sáng" (which indicates the start of a sentence). When we shift this sequence to the right by one position, we obtain "Chào bu?i sáng <EOS>" (where <EOS> marks the end of the sentence). Here, the original sequence serves as the input, while the shifted sequence acts as the target or label. This alignment ensures that the model learns to predict the next element in the sequence based on the preceding elements.

In practical terms, this means that after providing the model with the input sequence "<SOS> Chào bu?i sáng", it will be trained to predict the token "<EOS>", which signifies the end of the sentence. By repeating this process across various input sequences and their corresponding labels, the model learns the underlying patterns and dependencies within the data, enabling it to generate coherent predictions when given new input sequences.


input_vn_data = []
label_vn_data = []

for vector in corpus_ids_vn:
    input_vn_data.append(vector[:-1])
    label_vn_data.append(vector[1:])  

# convert to tensors
input_vn_data = torch.tensor(input_vn_data, dtype=torch.long)
label_vn_data = torch.tensor(label_vn_data, dtype=torch.long)

# print
print(input_vn_data)
print(label_vn_data)        

In the second part, I will focus on how we design the Transformer model and train the data [Coming soon!!!]

Thank you for reading the article! If you're interested in more of my shares, you can check out the following articles or subscribe to my YouTube channel here:

  • How did I become a Data Analyst? [Link] only in Vietnamese
  • What did I study to become a Data Analyst? [Link] only in Vietnamese
  • The journey of becoming a data analyst from Vietnam to Germany (Sponsor Visa) [Link] only in Vietnamese

Feel free to leave comments and interact to motivate me to write Part 2!







Daniel Nguyen

Ex @TikTok | Strategy & Data Analytics | Look for Data Opportunities in Sweden

1 年

Good job bro!

要查看或添加评论,请登录

Duy Bui的更多文章

社区洞察

其他会员也浏览了