登录查看更多内容

Update on the Neural MT Gateway

Terence Lewis

Translator & Software Developer

发布日期: 2022年5月16日

For the past few months I've been working on my "free online translation site" at https://nmtgateway.com. It's not that I've stopped working for money - no, it's just that folks can use this website free of charge to translate short sentences from and into a variety of languages, some of which can't be found elsewhere (at present). I'm still very keen on getting paid for other work I do! Hang on, I hear you saying, can't Google Translate, Microsoft Translator and a whole swarm of other sites do that and perhaps do it better? Google and Microsoft have extensive server farms. I have two oldish servers in a corner of my tiny workroom. So, what's the point? Well, you might also ask what's the point of a young artist picking up a paint-brush when people can go to a museum and view the works of Van Gogh, Rembrandt and Picasso. But then should the young artist be in a paralyzing awe of Picasso?

Let's take a simple English sentence:

"This day we celebrate the liberation of our country."

For the Albanian translation Google gives us:

"K?t? dit? do t? kremtojm? ?lirimin e vendit ton?."

But then the "nmtgateway" gives us:

"K?t? dit? do t? festojm? ?lirimin e vendit ton?." Anyone notice any difference?

Ewe is spoken by about 3 million in the Volta Region of south-east Ghana, and also in southwest Togo and in parts of Benin. Google Translate doesn't yet offer Ewe. The "nmtgateway" translates our sentence into Ewe as:

"Mía?u mía?e duk?a ?e abl??e ?e az? le ?keke sia dzi."

Back-translated, the sentence reads: "We will be celebrating our nation's freedom on this date." So, the Ewe sentence captures the meaning of the English source. Even at this level of simple sentences, this is hopefully a useful contribution to the enabling of communication with speakers of Ewe!

At this point I have to come clean. Neither the Albanian<>English models, nor the Ewe<>English models were trained by me. They are two of the thousand or so models trained by J?rg Tiedemann and his team at the University of Helsinki for the Opus-MT project (https://github.com/Helsinki-NLP/Opus-MT). I am progressively fine-tuning these models as and when I get hold of suitable datasets. For the African languages these datasets are few and far between.

A useful resource for me has been the Lanfrica website (https://lanfrica.com) This organisation states "Lanfrica is a language-focused search engine that makes it fast and easy to find information on the Internet about resources relating to African languages. We aim to catalogue and connect all African language resources, one record at a time."

One of the treasures I have found on this website is an English-Luganda parallel corpus. Luganda is spoken by about 3 million Baganda people, who live mainly in the Buganda region in southern Uganda. It is also widely used elsewhere in Uganda as a second language

The website states: "This English-Luganda parallel sentence corpus was created by a team of researchers from AI & Data science research Lab at Makerere University with a team of Luganda teachers, students and freelancers. The collaborative work which involves generating English sentences under CC-0 and translating these sentences using a crowdsourcing, iterative and opensource approach was done using Pontoon an opensource Translation Management System built by Mozilla. Acknowledgment: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa". The dataset consists of

18100 human-translated sentence pairs. An example of a translation using my interactive command line script is shown below:

Enter path of pretrained model: ./Helsinki-NLP/os-mt-lg-en

Enter path of ct2_model: ./converted_ct2_models/opus-mt-lg-en-finetuned_ctranslate2

Enter text to be translated: Guno ddwaliro erisinga mu ggwanga.

Translation: This is the best hospital in the country.

The keen-eyed reader will notice the word "ctranslate2". I am gradually converting the Hugging Face/Opus-MT models on my website to the CTranslate2 format, CTranslate2 being a C++ and Python library for efficient inference with Transformer models (https://github.com/OpenNMT/CTranslate2). This significantly increases the speed of inference (translation). The OpenNMT team have now written scripts to convert models from the following frameworks:

- OpenNMT-py

- OpenNMT-tf

- Fairseq

- Marian

- Transformers

The Amharic-English model on my site was trained by me using the OpenNMT-tf toolkit and then converted into the CTranslate2 format. For some reason, there seems to be no Opus-MT Amharic-English model despite the fact that the language has some 25 million speakers in Ethiopia. My Dutch<>English and Tagalog<>English models have all been converted to the CTranslate2 format, but the older Spanish<>Engish and Indonesian<>English language pairs still rely on older Open-NMT-tf models. By and large, however, my course of action from now on will be to deploy Opus-MT pretrained models and follow a pattern of continously fine-tuning them as and when I acquire useful datasets.

I'm well aware that the Neural MT Gateway offers translations of varying quality which certainly reflects the volume of data on which the models have been trained. We can't expect a dataset built on Jehovah Witness documentation and little else to result in a model that delivers the same "near-human" quality on contemporary general texts as we would expect from a Russian-English model trained on many millions of sentences. That does not mean that our Ewe-English model would be useless in the field. The ability to translate a sentence like "Where is the nearest clean water?" is definitely useful.

Datasets can take the form of TMX files. I would happy to receive datasets for African languages with a view to fine-tuning existing models. I would also be keen to talk to potential customers who are interested in developing in-house or on-premise MT solutions for low-resourced language pairs. I can be reached on [email protected]

Thanks for taking the time to read this!

要查看或添加评论，请登录

Terence Lewis的更多文章

The Local AI Translator is here!

2024年8月22日

The Local AI Translator is here!

After extensive testing and re-testing we’ve decided to send our Local AI Translator out into the world! For years…

2 条评论
Chatting with a PDF

2024年5月24日

Chatting with a PDF

This post refers to a RAG dialog with a PDF using the small language model specified below. The aim is to show what can…
QUESTIONS FOR LOCAL LLAMA3

2024年5月13日

QUESTIONS FOR LOCAL LLAMA3

This post shows the responses from I obtained from Meta-Llama-3-8B-Instruct-GGUF\Meta-Llama-3-8B-Instruct-Q4_K_M.gguf…
Whose brain’s in the garbage bin?

2023年9月17日

Whose brain’s in the garbage bin?

(or Musings of an old translator about AI). This morning I received a communication from an online group I’m in.
Fine tuning GPT-3.5-Turbo for translation

2023年9月1日

Fine tuning GPT-3.5-Turbo for translation

Terence Lewis Since GPT-3.5-Turbo was released for fine tuning several weeks ago the Web has been swamped with YouTube…

8 条评论
Update on the Neural MT Gateway

2022年5月16日

Update on the Neural MT Gateway

For the past few months I've been working on my "free online translation site" at https://nmtgateway.com.
The baby is full of fire!

2022年2月22日

The baby is full of fire!

As I've mentioned elsewhere, I've been adding African languages to my free online translation site…

See all articles

Update on the Neural MT Gateway

Terence Lewis

Translator & Software Developer

Terence Lewis的更多文章

社区洞察

其他会员也浏览了

The last untranslatable topics

Super-Human Translation?

Text Battle: Professional Writer vs. AI Chat and Machine Translation

Enabling Multi-Lingual Support in LLMOps Pipelines

Use Case: AI-assisted translation App

Are language professionals doomed?

BWX AI! Make my day, take these tags away!

Converting Paragraph-Based TMX to Sentence-Based Segmentation

Looking Back on 20 Years of TAUS Where did we go wrong?

The Latest News from ARC Writing and Translation Services

Terence Lewis的更多文章

The Local AI Translator is here!

Chatting with a PDF

QUESTIONS FOR LOCAL LLAMA3

Whose brain’s in the garbage bin?

Fine tuning GPT-3.5-Turbo for translation