登录查看更多内容

Unlocking Vietnamese-English Machine Translation with PyTorch Transformers (P1- Data Preparation)

Duy Bui

Data Analyst @ Zalando| Empowering Business Decisions Through Advanced Analytics and Machine Learning

发布日期: 2024年2月25日

In today's interconnected world, breaking down language barriers is essential for communication and understanding across cultures. With advancements in natural language processing (NLP), powerful tools like transformers have revolutionized machine translation. In this article, we'll delve into how to leverage PyTorch transformers to facilitate Vietnamese-English translation.

Install important packages

import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

Prepare the tokenizers

Tokenization is a pivotal initial step in natural language processing (NLP), essential for breaking down text into manageable units. In the provided code snippet, a tokenizer function is established using the get_tokenizer method with the 'basic_english' parameter, indicating its adherence to fundamental English language rules.

For example, "?n qu? nh? k? tr?ng cay" will be tokenize to [12,9,7,6,11,4]

Source: AI Vietnam — Image created by AI Vietnam

# Define tokenizer function
tokenizer = get_tokenizer('basic_english')

# Create a function to yield list of tokens
def yield_tokens(examples):
    for text in examples:
        yield tokenizer(text)

# Tokenize and numericalize your samples
def vectorize_en(text, vocab, sequence_length):
    tokens = tokenizer(text)
    tokens = [vocab[token] for token in tokens] + [vocab["<eos>"]]
    token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
    return token_ids

def vectorize_vn(text, vocab, sequence_length):
    tokens = tokenizer(text)
    tokens = [vocab["<sos>"]] + [vocab[token] for token in tokens] + [vocab["<eos>"]]
    token_ids = tokens[:sequence_length] + [vocab["<pad>"]] * (sequence_length - len(tokens))
    return token_ids

corpus_en = [
    "good morning",
    "ai books"    
]
data_size_en = len(corpus_en)

Create Vocabulary

If we input 2 sentences

"good morning", "AI books"

Then we will get the vocabulary for English:

{'morning': 6, 'good': 5, 'books': 4, 'ai': 3, '<eos>': 2, '<pad>': 1, '<unk>': 0}

corpus_en = [
    "good morning",
    "ai books"    
]
data_size_en = len(corpus_en)

# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3

# max vocabulary size and sequence length
vocab_size_en = 7
sequence_length_en = 3

vocab_en = build_vocab_from_iterator(yield_tokens(corpus_en),
                                     max_tokens=vocab_size_en,
                                     specials=["<unk>", "<pad>", "<eos>"])
vocab_en.set_default_index(vocab_en["<unk>"])
vocab_en.get_stoi()

# Vectorize the samples
corpus_ids_en = []
for sentence in corpus_en:
    corpus_ids_en.append(vectorize_en(sentence, vocab_en, sequence_length_en))

# print
en_data = torch.tensor(corpus_ids_en, dtype=torch.long)
print(en_data)

We repeat the same steps for creating the Vietnamese corpus with 2 example sentences:

"chào bu?i sáng", "sách ai"

Then we get the results like:

{'sách': 7, 'sáng': 8, 'chào': 6, 'bu?i': 5, '<sos>': 2, 'ai': 4, '<eos>': 3, '<pad>': 1, '<unk>': 0}

领英推荐

What are Transformers?

Jyoti Dabass, Ph.D 1 个月前

Mastering Prompt Engineering Strategies and Tactics

Krishna Srikanth K 1 年前

LLMs and False Promise of Creativity; LLMs as…

Danny Butvinik 1 年前

corpus_vn = [
    "chào bu?i sáng",
    "sách ai"    
]
data_size_vn = len(corpus_vn)

# max vocabulary size and sequence length
vocab_size_vn = 9
sequence_length_vn = 4

# Create vocabulary
vocab_vn = build_vocab_from_iterator(yield_tokens(corpus_vn),
                                  max_tokens=vocab_size_vn,
                                  specials=["<unk>", "<pad>", "<sos>", "<eos>"])
vocab_vn.set_default_index(vocab_vn["<unk>"])
vocab_vn.get_stoi()

# Vectorize the samples
corpus_ids_vn = []
for sentence in corpus_vn:
    corpus_ids_vn.append(vectorize_vn(sentence, vocab_vn, sequence_length_vn+1))

# print
print(corpus_ids_vn)

Create data and label

Shifting right is a fundamental operation used in sequence modelling tasks such as language modelling or sequence prediction. Its primary function is to create input-output pairs, crucial for training supervised learning models.

Consider the sentence: "<SOS> Chào bu?i sáng" (which indicates the start of a sentence). When we shift this sequence to the right by one position, we obtain "Chào bu?i sáng <EOS>" (where <EOS> marks the end of the sentence). Here, the original sequence serves as the input, while the shifted sequence acts as the target or label. This alignment ensures that the model learns to predict the next element in the sequence based on the preceding elements.

In practical terms, this means that after providing the model with the input sequence "<SOS> Chào bu?i sáng", it will be trained to predict the token "<EOS>", which signifies the end of the sentence. By repeating this process across various input sequences and their corresponding labels, the model learns the underlying patterns and dependencies within the data, enabling it to generate coherent predictions when given new input sequences.

input_vn_data = []
label_vn_data = []

for vector in corpus_ids_vn:
    input_vn_data.append(vector[:-1])
    label_vn_data.append(vector[1:])  

# convert to tensors
input_vn_data = torch.tensor(input_vn_data, dtype=torch.long)
label_vn_data = torch.tensor(label_vn_data, dtype=torch.long)

# print
print(input_vn_data)
print(label_vn_data)

In the second part, I will focus on how we design the Transformer model and train the data [Coming soon!!!]

Thank you for reading the article! If you're interested in more of my shares, you can check out the following articles or subscribe to my YouTube channel here:

How did I become a Data Analyst? [Link] only in Vietnamese
What did I study to become a Data Analyst? [Link] only in Vietnamese
The journey of becoming a data analyst from Vietnam to Germany (Sponsor Visa) [Link] only in Vietnamese

Feel free to leave comments and interact to motivate me to write Part 2!

Daniel Nguyen

Ex @TikTok | Strategy & Data Analytics | Look for Data Opportunities in Sweden

1 年

Good job bro!

1 次回应

要查看或添加评论，请登录

Duy Bui的更多文章

Why I like these SQL tricks - Part 1

2023年12月10日

Why I like these SQL tricks - Part 1

Trong l?p SQL c?a mình, các b?n h?c viên hay th?c m?c t?i sao có WHERE 1=1, ?ó c?ng chính là ly do mình vi?t bài này…

3 条评论
Hành trình làm data analyst t? Vi?t Nam sang ??c (Sponsor Visa)

2023年11月28日

Hành trình làm data analyst t? Vi?t Nam sang ??c (Sponsor Visa)

Chào các b?n !!! T?i sao có bài vi?t này Bài vi?t này này chia s? l?i hành trình c?a mình t? m?t ng??i có background…

14 条评论
Unlocking SQL Wizardry: Exploring the Google Merchandise Store Dataset on BigQuery

2023年11月16日

Unlocking SQL Wizardry: Exploring the Google Merchandise Store Dataset on BigQuery

In the realm of e-commerce analytics, the Google Merchandise Store dataset on BigQuery stands as a goldmine of valuable…
How to use SQL with the Google Trend dataset?

2023年5月1日

How to use SQL with the Google Trend dataset?

As data becomes more and more important to businesses and organizations, there's a growing need for tools that can help…
Part 2: Mình h?c gì ?? tr? thành Data Analyst ?

2021年6月20日

Part 2: Mình h?c gì ?? tr? thành Data Analyst ?

Chào các b?n, sau Part 1 mình bày t? quan ?i?m v? nh?ng y?u t? c?n v? ki?n th?c và k? n?ng ?? tr? thành Data Analyst…

93 条评论
Mình tr? thành Data Analyst nh? th? nào?

2021年5月25日

Mình tr? thành Data Analyst nh? th? nào?

Mình Tr? Thành Data Analyst Nh? Th? Nào? Chào các b?n, ?ay là bài vi?t ??u tiên c?a mình ?? chia s? v? nh?ng vi?c xoay…

17 条评论

See all articles

Unlocking Vietnamese-English Machine Translation with PyTorch Transformers (P1- Data Preparation)

Duy Bui

Data Analyst @ Zalando| Empowering Business Decisions Through Advanced Analytics and Machine Learning

Install important packages

Prepare the tokenizers

Create Vocabulary

领英推荐

Create data and label

Duy Bui的更多文章

社区洞察

其他会员也浏览了

Differences Between RAG and Fine Tuning

Retrieval Augmented Generation (RAG)

Introduction to Transformer Models

Leveraging LLMLingua for Efficient Inference in Large Language Models

Tuning Large Language Models - A Guide for Beginners

Arabic Large Language Models (LLMs): Transforming the Future of NLP in Saudi Arabia

How are LLMs tackling the pertinent challenge of entropy?

How to Evaluate the Performance of Large Language Models (LLMs)

Install important packages

Prepare the tokenizers

Create Vocabulary

领英推荐

Create data and label

Duy Bui的更多文章

Why I like these SQL tricks - Part 1

Hành trình làm data analyst t? Vi?t Nam sang ??c (Sponsor Visa)

Unlocking SQL Wizardry: Exploring the Google Merchandise Store Dataset on BigQuery

How to use SQL with the Google Trend dataset?

Part 2: Mình h?c gì ?? tr? thành Data Analyst ?

Mình tr? thành Data Analyst nh? th? nào?

社区洞察

其他会员也浏览了

Differences Between RAG and Fine Tuning

Retrieval Augmented Generation (RAG)

Introduction to Transformer Models

Leveraging LLMLingua for Efficient Inference in Large Language Models

Tuning Large Language Models - A Guide for Beginners

Arabic Large Language Models (LLMs): Transforming the Future of NLP in Saudi Arabia

How are LLMs tackling the pertinent challenge of entropy?

How to Evaluate the Performance of Large Language Models (LLMs)