NLP Translation using Hugging Face's Transformers and SacreMoses libraries

NLP Translation using Hugging Face's Transformers and SacreMoses libraries

Natural Language Processing (NLP) has made significant progress in recent years due to advances in deep learning techniques such as Transformer models. One popular application of NLP is machine translation, where a model can translate text from one language to another. In this tutorial, we will learn how to use Hugging Face's Transformers library to perform neural machine translation between English and French using pre-trained models. We will also use SacreMoses for text preprocessing.

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/translation_using_Helsinki_NLP.ipynb

!pip install transformers accelerate        

This command installs the Transformers and Accelerate packages. The Transformers package contains implementations of various state-of-the-art NLP models, while Accelerate provides tools for distributed training.

!pip install sacremoses        

This command installs the SacreMoses package, which provides utilities for Moses-style text preprocessing for NLP tasks.

from transformers import AutoTokenizer, MarianMTModel        

This imports the necessary classes from the Transformers package - AutoTokenizer for creating a tokenizer object that converts text into numerical tokens, and MarianMTModel for loading a pre-trained machine translation model.

src = "en" 
trg = "fr"        

These variables define the source and target languages for our machine translation task. Here, we are translating from English ("en") to French ("fr").

model_name = f"Helsinki-NLP/opus-mt-{src}-{trg}"        

This variable defines the name of the pre-trained machine translation model we will be using. In this case, we are using the Helsinki-NLP implementation of the OPUS-MT model trained on data from the OPUS project.

model = MarianMTModel.from_pretrained(model_name) and tokenizer = AutoTokenizer.from_pretrained(model_name)        

These lines load the pre-trained machine translation model and its corresponding tokenizer from their saved locations. The tokenizer is used to convert input text into numerical tokens that the model can understand.

sample_text = "Vegetables share resources with nitrogen-fixing bacteria"        

This line defines the input text we want to translate.

batch = tokenizer([sample_text], return_tensors="pt")        

This line encodes the input text into numerical tokens using the loaded tokenizer. It returns a PyTorch tensor containing these tokens.

generated_ids = model.generate(**batch)        

This line generates the translated output by passing the encoded input through the pre-trained machine translation model. It returns the IDs of the generated tokens.

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]        

This line decodes the generated token IDs back into human-readable text using the loaded tokenizer. It skips special tokens added during encoding and retrieves only the actual translated text.

That's it! With just a few lines of code, we have performed machine translation using Hugging Face's Transformers library and SacreMoses. You can experiment with different source and target languages, or even train your own custom machine translation model using the same workflow.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了