5 Tools to Build Your Basic Machine Translation Toolkit – Part I

5 Tools to Build Your Basic Machine Translation Toolkit – Part I

 If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.

 As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read “The Next Big Thing You Missed: Why eBay, Not Google, Could Save Automated Translation”. 

1. Advanced Text Editors

Notepad won’t cut it, trust me. You need an advanced text editor that can, at least:

  • deal with different encodings (UTF, ANSI, etc.)
  •  open big files, sometimes with unusual formats/extensions
  • do global search and replace operations with regular expressions support
  • highlight syntax (display different programming, scripting or markup languages -XML, HTML, etc.- with color codes)
  • have multiple files open at the same time (tabs)

This is a list of my favorite text editors, but there are a lot of good ones out there.

Notepad ++: My editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you close it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It’s really easy to convert from/to different encodings and save all opened files at once. You can also download different plugins, like spellcheckers, comparators, etc. It’s free and you can download it from here.

Sublime: This is another amazing editor, and a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs, as well. It has a distraction-free mode if you really need to focus. It’s also free, and you can get it here.

EmEditor: Syntax highlighting, document comparison, regular expressions, handles huge files, encoding conversion… Emeditor is extremely complete. My favorite feature, however, are the scriptable macros. This means, you can create, record, and run macros within EmEditor – you can use these macros to automate repetitive tasks, like making changes in several files and/or saving them with different extensions. You can download it from here.

2. QA Tools

Quality Assurance Tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way: 1) you load files with your translated content (source + target); 2) you optionally load reference content, like glossaries, translation memories, previously translated files or blacklists; 3)the tool checks your content and provides a report listing potential errors. Some of the errors you can find using a QA Tool are:

  • terminology: term A in the source is not translated as B in the target
  • blacklisted terms: terms you don’t want to see in the target
  • inconsistencies: same source segment with different translations
  • differences in numbers: source and target numbers should match
  • capitalization
  • punctuation: missing or extra periods, duplicate commas, etc.
  • patterns: certain used defined patterns of words, numbers and signs, which may contain regular expressions to make them more flexible, expected to occur in a file.
  • grammar and spelling errors
  • duplicate words, tripled letters, and more.

Some QA Tools you should try are:

Xbench allows you to run the following QA Checks: find untranslated segments, segments with the same source text and different target text, and segments with the same target text and different source text, find segments whose target text matches the source text (potentially untranslated text), tag mismatches, number mismatches, double blanks, repeated words, terminology mismatches against a list of key terms, and spell-check translations.

Some linguists like to add all their reference materials in Xbench, like translation memories, glossaries, termbases and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.

Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited but there are ways to expand it. You can get Xbench here.

Checkmate is the QA Tool part of the Okapi Framework, which is an open source suit of applications to support the localization process. That means, the Framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are: repeated words, corrupted characters, patterns, inline codes differences, significant differences in length between source and target, missing translations, spaces, etc. The patterns section is especially interesting; I will come back to it in the future. Checkmate produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open source spelling and grammar checker. You can find Checkmate here.

Do you  want to learn which other tools made it to the toolkit? Check out the second part of this post to find out!

If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.

Joshua Velásquez

? Localization without the fluff | My Blog: LocalizationTimes.com

5 年

It's 2019 and I am just loving this excellent article! Well done, Juan!

回复
Luis Mondragón

Machine Translation @ Uber

8 年

Priceless collection of resources, thank you

David Mochen

Spanish Language Lead en Acclaro

9 年

Great stuff, Juan, cheers!

Ray Fan

Founder at Ray Content Technologies (Beijing) Co., Ltd.

9 年

I look forward to reading your series. It seems PromMT is one of eBay's MT adopted solutions. Safaba does the same as well. How well are these two hybrid solutions working at eBay?Cheers

要查看或添加评论,请登录

Juan Rowda的更多文章

  • Corpus Analysis III - Python

    Corpus Analysis III - Python

    In the previous posts of these series, we discussed how you can use AntConc to understand your corpora better. This…

    2 条评论
  • Adaptive Machine Translation in a Nutshell

    Adaptive Machine Translation in a Nutshell

    If you haven’t heard about Adaptive Machine Translation, you haven’t been paying enough attention lately. It became a…

    2 条评论
  • Corpus Analysis - Part II

    Corpus Analysis - Part II

    In the first part of the series, I covered the importance of corpus analysis and how a tool called AntConc can be used…

    2 条评论
  • Corpus Analysis - Part I

    Corpus Analysis - Part I

    The Introduction As you probably know, Statistical Machine Translation (SMT) needs considerably big amounts of text…

  • A Language Approach to Machine Translation Quality Estimation

    A Language Approach to Machine Translation Quality Estimation

    This article is based on a quality estimation method I developed and presented at AMTA in 2015. The premise of the…

    2 条评论
  • Machine Translation Challenges in e-commerce: Product Reviews

    Machine Translation Challenges in e-commerce: Product Reviews

    The Usual Quick Introduction If you are familiar with our posts, you probably know by now that, at eBay, we use Machine…

  • Visualizing Translation Quality Data - Part I

    Visualizing Translation Quality Data - Part I

    There is no knowledge that is not power We are, no doubt, living some of the most exciting days of the Information Age.…

    6 条评论
  • Polysemy in Statistical MT - Tips for Linguists

    Polysemy in Statistical MT - Tips for Linguists

    A quick introduction Perhaps Machine Translation would be solved by now if each word had only one meaning and, thus…

    7 条评论
  • Human Evaluation of Machine Translation

    Human Evaluation of Machine Translation

    by Olga Pospelova and Juan Rowda Machine translation (MT) evaluation is essential in machine translation development…

  • The Basics of Quality Estimation

    The Basics of Quality Estimation

    Quality Estimation is a method used to automatically provide a quality indication for machine translation output…

    5 条评论

社区洞察

其他会员也浏览了