Unlocking Vietnamese-English Machine Translation with PyTorch Transformers (P1- Data Preparation)
In today's interconnected world, breaking down language barriers is essential for communication and understanding across cultures. With advancements in natural language processing (NLP), powerful tools like transformers have revolutionized machine translation. In this article, we'll delve into how to leverage PyTorch transformers to facilitate Vietnamese-English translation.
Install important packages
import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
Prepare the tokenizers
Tokenization is a pivotal initial step in natural language processing (NLP), essential for breaking down text into manageable units. In the provided code snippet, a tokenizer function is established using the get_tokenizer method with the 'basic_english' parameter, indicating its adherence to fundamental English language rules.
For example, "?n qu? nh? k? tr?ng cay" will be tokenize to [12,9,7,6,11,4]
# Define tokenizer function
tokenizer = get_tokenizer('basic_english')
# Create a function to yield list of tokens
def yield_tokens(examples):
for text in examples:
yield tokenizer(text)
# Tokenize and numericalize your samples
def vectorize_en(text, vocab, sequence_length):
tokens = tokenizer(text)
tokens = [vocab[token] for token in tokens] + [vocab["<eos>"]]
token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
return token_ids
def vectorize_vn(text, vocab, sequence_length):
tokens = tokenizer(text)
tokens = [vocab["<sos>"]] + [vocab[token] for token in tokens] + [vocab["<eos>"]]
token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
return token_ids
corpus_en = [
"good morning",
"ai books"
]
data_size_en = len(corpus_en)
Create Vocabulary
If we input 2 sentences
Then we will get the vocabulary for English:
{'morning': 6, 'good': 5, 'books': 4, 'ai': 3, '<eos>': 2, '<pad>': 1, '<unk>': 0}
corpus_en = [
"good morning",
"ai books"
]
data_size_en = len(corpus_en)
# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3
# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3
vocab_en = build_vocab_from_iterator(yield_tokens(corpus_en),
max_tokens=vocab_size_en,
specials=["<unk>", "<pad>", "<eos>"])
vocab_en.set_default_index(vocab_en["<unk>"])
vocab_en.get_stoi()
# Vectorize the samples
corpus_ids_en = []
for sentence in corpus_en:
corpus_ids_en.append(vectorize_en(sentence, vocab_en, sequence_length_en))
# print
en_data = torch.tensor(corpus_ids_en, dtype=torch.long)
print(en_data)
We repeat the same steps for creating the Vietnamese corpus with 2 example sentences:
Then we get the results like:
{'sách': 7, 'sáng': 8, 'chào': 6, 'bu?i': 5, '<sos>': 2, 'ai': 4, '<eos>': 3, '<pad>': 1, '<unk>': 0}
领英推荐
corpus_vn = [
"chào bu?i sáng",
"sách ai"
]
data_size_vn = len(corpus_vn)
# max vocabulary size and sequence length
vocab_size_vn = 9
sequence_length_vn = 4
# Create vocabulary
vocab_vn = build_vocab_from_iterator(yield_tokens(corpus_vn),
max_tokens=vocab_size_vn,
specials=["<unk>", "<pad>", "<sos>", "<eos>"])
vocab_vn.set_default_index(vocab_vn["<unk>"])
vocab_vn.get_stoi()
# Vectorize the samples
corpus_ids_vn = []
for sentence in corpus_vn:
corpus_ids_vn.append(vectorize_vn(sentence, vocab_vn, sequence_length_vn+1))
# print
print(corpus_ids_vn)
Create data and label
Shifting right is a fundamental operation used in sequence modelling tasks such as language modelling or sequence prediction. Its primary function is to create input-output pairs, crucial for training supervised learning models.
Consider the sentence: "<SOS> Chào bu?i sáng" (which indicates the start of a sentence). When we shift this sequence to the right by one position, we obtain "Chào bu?i sáng <EOS>" (where <EOS> marks the end of the sentence). Here, the original sequence serves as the input, while the shifted sequence acts as the target or label. This alignment ensures that the model learns to predict the next element in the sequence based on the preceding elements.
In practical terms, this means that after providing the model with the input sequence "<SOS> Chào bu?i sáng", it will be trained to predict the token "<EOS>", which signifies the end of the sentence. By repeating this process across various input sequences and their corresponding labels, the model learns the underlying patterns and dependencies within the data, enabling it to generate coherent predictions when given new input sequences.
input_vn_data = []
label_vn_data = []
for vector in corpus_ids_vn:
input_vn_data.append(vector[:-1])
label_vn_data.append(vector[1:])
# convert to tensors
input_vn_data = torch.tensor(input_vn_data, dtype=torch.long)
label_vn_data = torch.tensor(label_vn_data, dtype=torch.long)
# print
print(input_vn_data)
print(label_vn_data)
In the second part, I will focus on how we design the Transformer model and train the data [Coming soon!!!]
Thank you for reading the article! If you're interested in more of my shares, you can check out the following articles or subscribe to my YouTube channel here:
Feel free to leave comments and interact to motivate me to write Part 2!
Ex @TikTok | Strategy & Data Analytics | Look for Data Opportunities in Sweden
1 年Good job bro!