NLP Translation using Hugging Face's Transformers and SacreMoses libraries

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

发布日期: 2024年4月9日

Natural Language Processing (NLP) has made significant progress in recent years due to advances in deep learning techniques such as Transformer models. One popular application of NLP is machine translation, where a model can translate text from one language to another. In this tutorial, we will learn how to use Hugging Face's Transformers library to perform neural machine translation between English and French using pre-trained models. We will also use SacreMoses for text preprocessing.

Notebook link: https://github.com/ArjunAranetaCodes/LangChain-Guides/blob/main/translation_using_Helsinki_NLP.ipynb

!pip install transformers accelerate

This command installs the Transformers and Accelerate packages. The Transformers package contains implementations of various state-of-the-art NLP models, while Accelerate provides tools for distributed training.

!pip install sacremoses

This command installs the SacreMoses package, which provides utilities for Moses-style text preprocessing for NLP tasks.

from transformers import AutoTokenizer, MarianMTModel

This imports the necessary classes from the Transformers package - AutoTokenizer for creating a tokenizer object that converts text into numerical tokens, and MarianMTModel for loading a pre-trained machine translation model.

src = "en" 
trg = "fr"

These variables define the source and target languages for our machine translation task. Here, we are translating from English ("en") to French ("fr").

model_name = f"Helsinki-NLP/opus-mt-{src}-{trg}"

This variable defines the name of the pre-trained machine translation model we will be using. In this case, we are using the Helsinki-NLP implementation of the OPUS-MT model trained on data from the OPUS project.

Harshit Goyal 1 年前

Improving Word Representations via Global Context and…

Nisha Jha 1 年前

Improving Word Representations via Global Context and…

ROHAN WASANKAR 1 年前

model = MarianMTModel.from_pretrained(model_name) and tokenizer = AutoTokenizer.from_pretrained(model_name)

These lines load the pre-trained machine translation model and its corresponding tokenizer from their saved locations. The tokenizer is used to convert input text into numerical tokens that the model can understand.

sample_text = "Vegetables share resources with nitrogen-fixing bacteria"

This line defines the input text we want to translate.

batch = tokenizer([sample_text], return_tensors="pt")

This line encodes the input text into numerical tokens using the loaded tokenizer. It returns a PyTorch tensor containing these tokens.

generated_ids = model.generate(**batch)

This line generates the translated output by passing the encoded input through the pre-trained machine translation model. It returns the IDs of the generated tokens.

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

This line decodes the generated token IDs back into human-readable text using the loaded tokenizer. It skips special tokens added during encoding and retrieves only the actual translated text.

That's it! With just a few lines of code, we have performed machine translation using Hugging Face's Transformers library and SacreMoses. You can experiment with different source and target languages, or even train your own custom machine translation model using the same workflow.

要查看或添加评论，请登录

查看全部

NLP Translation using Hugging Face's Transformers and SacreMoses libraries

Arjun Araneta

Backend Developer | Java Certified | Full Stack Developer

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Improving Word Representations via Global Context and Multiple Word Prototypes

How will machine learning and NLP disrupt contract management

Improving Word Representations via Global Context and Multiple Word Prototypes

NLP/NLU, Does Watson Need Sherlock?

nlp@scale

Tokenization in NLP: Decoding Text into Machine Readable Format

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

Unpacking Word Embeddings: A Journey Through Modern NLP Techniques

Exploring Word Embedding Techniques Based on Count or Frequency: A Practical Guide

领英推荐

Java Stream Intermediate Operations

2024年6月26日

Collecting a Stream to an Immutable Collection in Java

2024年6月21日

Simple salary computations using Java 8 Lambda Expressions (Collectors.groupingBy)

2024年6月19日

Parallel Array Manipulation Using Java Streams

2024年6月15日

Integrating Image Related AI models using Streamlit, Python, and Replicate API (Kinda easy even for?me!)

2024年5月6日

Conversational AI with Langchain & Hugging Face: Building a Simple Chatbot Interface

2024年4月22日

"Fill Mask" Model and HuggingFace's Pipeline API

2024年4月18日

Feature Extraction with HuggingFace's Sentence Transformers

2024年4月15日

Text Summarization using Hugging Face's T5 Model

2024年4月13日

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

2024年4月6日

社区洞察

其他会员也浏览了

Improving Word Representations via Global Context and Multiple Word Prototypes

How will machine learning and NLP disrupt contract management

Improving Word Representations via Global Context and Multiple Word Prototypes

NLP/NLU, Does Watson Need Sherlock?

nlp@scale

Tokenization in NLP: Decoding Text into Machine Readable Format

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

Unpacking Word Embeddings: A Journey Through Modern NLP Techniques

Exploring Word Embedding Techniques Based on Count or Frequency: A Practical Guide