5 Tools to Build Your Basic Machine Translation Toolkit – Part I
If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.
As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read “The Next Big Thing You Missed: Why eBay, Not Google, Could Save Automated Translation”.
1. Advanced Text Editors
Notepad won’t cut it, trust me. You need an advanced text editor that can, at least:
- deal with different encodings (UTF, ANSI, etc.)
- open big files, sometimes with unusual formats/extensions
- do global search and replace operations with regular expressions support
- highlight syntax (display different programming, scripting or markup languages -XML, HTML, etc.- with color codes)
- have multiple files open at the same time (tabs)
This is a list of my favorite text editors, but there are a lot of good ones out there.
Notepad ++: My editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you close it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It’s really easy to convert from/to different encodings and save all opened files at once. You can also download different plugins, like spellcheckers, comparators, etc. It’s free and you can download it from here.
Sublime: This is another amazing editor, and a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs, as well. It has a distraction-free mode if you really need to focus. It’s also free, and you can get it here.
EmEditor: Syntax highlighting, document comparison, regular expressions, handles huge files, encoding conversion… Emeditor is extremely complete. My favorite feature, however, are the scriptable macros. This means, you can create, record, and run macros within EmEditor – you can use these macros to automate repetitive tasks, like making changes in several files and/or saving them with different extensions. You can download it from here.
2. QA Tools
Quality Assurance Tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way: 1) you load files with your translated content (source + target); 2) you optionally load reference content, like glossaries, translation memories, previously translated files or blacklists; 3)the tool checks your content and provides a report listing potential errors. Some of the errors you can find using a QA Tool are:
- terminology: term A in the source is not translated as B in the target
- blacklisted terms: terms you don’t want to see in the target
- inconsistencies: same source segment with different translations
- differences in numbers: source and target numbers should match
- capitalization
- punctuation: missing or extra periods, duplicate commas, etc.
- patterns: certain used defined patterns of words, numbers and signs, which may contain regular expressions to make them more flexible, expected to occur in a file.
- grammar and spelling errors
- duplicate words, tripled letters, and more.
Some QA Tools you should try are:
Xbench allows you to run the following QA Checks: find untranslated segments, segments with the same source text and different target text, and segments with the same target text and different source text, find segments whose target text matches the source text (potentially untranslated text), tag mismatches, number mismatches, double blanks, repeated words, terminology mismatches against a list of key terms, and spell-check translations.
Some linguists like to add all their reference materials in Xbench, like translation memories, glossaries, termbases and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.
Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited but there are ways to expand it. You can get Xbench here.
Checkmate is the QA Tool part of the Okapi Framework, which is an open source suit of applications to support the localization process. That means, the Framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are: repeated words, corrupted characters, patterns, inline codes differences, significant differences in length between source and target, missing translations, spaces, etc. The patterns section is especially interesting; I will come back to it in the future. Checkmate produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open source spelling and grammar checker. You can find Checkmate here.
Do you want to learn which other tools made it to the toolkit? Check out the second part of this post to find out!
If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.
? Localization without the fluff | My Blog: LocalizationTimes.com
5 年It's 2019 and I am just loving this excellent article! Well done, Juan!
Machine Translation @ Uber
8 年Priceless collection of resources, thank you
Spanish Language Lead en Acclaro
9 年Great stuff, Juan, cheers!
Founder at Ray Content Technologies (Beijing) Co., Ltd.
9 年I look forward to reading your series. It seems PromMT is one of eBay's MT adopted solutions. Safaba does the same as well. How well are these two hybrid solutions working at eBay?Cheers