On MT post-editing and TER
Post-editing machine-translated texts is a challenging task that requires intense effort.
The quality of raw output varies a lot, and the expectation of quality from post-editors is also different from project to project. For this reason, it’s difficult to predict how long it would take to post-edit an automated translation in order to make it publishable (or suitable for the desired use, whatever that is). As a consequence, how to price machine translation post-editing is a hot topic since ever.
One of the methods developed to help in predicting post-editing effort is Translation Error Rate (TER).
WHAT IS TER?
TER basically measures the amount of editing that a linguist would need to perform in order for a translation to match a reference (human) translation. When this analysis is repeated on different sample translations, it’s possible to estimate the post-editing effort required. It’s important to reiterate that the effort is measured in terms of quantity of changes.
TER score can be a value between 0 and 1, but more frequently it is calculated as a percentage. The lower the score, the better (i.e. the fewer edits are required). A high TER score implies that a translation will require more changes during post-editing.
HOW TER IS USED
As explained, TER evaluation method has been created to predict post-editing effort. However, many big MT players (including big post-editing buyers) have adopted the same approach to calculate the number of changes made during post-editing, therefore retrospectively, rather than as a predictive method.
To help those buyers, some CAT tools have embedded post-editing analysis with the ultimate objective to make post-editing pricing easier and more democratic.
Basically, the assumption is that the fewer the edits made, the less effort is needed from the post-editor. As a consequence, the post-editing rate can be adapted (i.e. reduced) accordingly.
THE FUTURE OF TER
Ask any post-editor, seasoned or not, you’ll be told post-editing requires a lot of effort.
1) Cognitive effort
But does TER (or better, does the number of edits) really reflect the cognitive effort by the post-editor? Simply said, cognitive effort is the effort related to the mental processes that take place during post-editing, which are not necessarily proportional to technical effort. This is because cognitive decisions do not necessarily involve any edits (for instance, when the MT output is left unchanged). Similarly, not all edits require much cognitive effort.
Also, the quantity of edits is just part of the issue, as one single edit might be more critical than multiple minor edits. The quality of the edits also needs to be taken into account.
For this reason, evaluating the quality of a machine translation raw output by the number of edits it requires fails to take into account the actual effort involved in the process.
2) Neural MT and beyond
Machine Translation has improved a lot in the last 4 years following the introduction of Neural Machine Translation by Google. With this technology, many of the typical issues of MT are solved, including lack of fluency, grammar mistakes, wrong concordance. Depending on different items (source quality, language combination, vertical, to mention a few) quality of NMT can be very high, with just a few edits (if any) needed.
However, the high level of fluency of texts translated with neural MT can be misleading and hide more serious mistakes (for instance, mistranslations, missing words or wrong terminology). One single edit in a sentence might take more to spot than the time needed to do the actual change and, again, requires much cognitive effort. Similarly, leaving a sentence unchanged (which is becoming more and more frequent) does not mean that the post-editor won’t have to read the target sentence and compare it to the source text. TER does not account for the actual time spent in the process.
3) The TER bias
No localization project is the same as the previous one, as we all know. With MT, variables are even more. What’s the level of quality required? Will the text be published, circulated internally or does it only serve for “gisting” purposes? What is the quality of the source text? Depending on the answer to these and many other questions, post-editing can involve more or fewer edits. For instance, if a text is only needed for gisting purposes, typos will not be fixed. If a text will need to be published, more edits will be required.
In a scenario where TER is used to establish compensation to the post-editor, and more edits equal a higher compensation, how can we be sure that the post-editor will not make extra changes that are not really necessary or, worse, go beyond and against the scope of the work? This could be unconscious of course, but the problem still stands.
POST-EDITING DISTANCE, REVERSED
The use of Machine Translation is increasing every day and many localization buyers are adopting this technology.
However, not all localization projects go through MT for different reasons, including fear of the technology, privacy concerns or intended use of the translated text. For instance, when the translation is used to train a Machine Translation engine, human translation is usually the best way to go.
Easy access to MT for everyone through integration and the low cost of the technology have urged localization buyers to find methods for spotting the forbidden use of MT for standard projects.
One method adopted is comparing the output of human translation with a reference machine translation, usually one available for free (such as Google NMT), and establishing how much the human translation differs from the machine translation. This is the same as TER, but with inverted roles. The idea is that a translation performed by a professional linguist has to be substantially different from the translation generated by a computer.
But for simple texts, and with quality of stock NMT engines increasing day after day, this difference becomes less and less evident.
Still, we had projects where customers claimed we used machine translation as “your translation and the one from Google Translate are pretty much the same”.
How should we deal with those situations? Is it really a good idea to force your provider to edit the translation just to make is “as different as possible from Google”? Strange as it can sound, it happened.
CONCLUSIONS
Machine Translation has been around and accessible for many years now, but the discussion of many items is still open on many levels. TER is certainly one of those.
Do you have any experience with TER that you can share? Any success story or big fail? I will be happy to hear from you!
Freelance English to Spanish translator specializing in marketing and software | Spanish copywriter
5 年Thanks for the article! I’ve never heard about such metrics as TER before!
Business Partner at Latamways with +20 years of experience in the localization industry. Developer and Supervisor of Latamways online Post-editing Training Course
5 年Very interesting article, Diego. Question: Where do yo get the TER metrics from? Do you use the ones included in some CAT? Thanks for opening the discussion.
I help leading businesses meet impossible deadlines by providing accurate and timely translations with the help of cutting-edge technology.Over 20 years' experience at the very top levels of business, media, government.
5 年I think the problem of pricing has always been there - even charging by the word for human translation has always been a somewhat imprecise convention, to make things that cannot be measured measurable... but no two texts are the same, words are not all the same, quality of original text is always different - all this to say, pricing in our industry will always be a haphazard exercise, regardless of the technology, a game in which sometimes you win and sometimes you lose (time, energy and money) - just my two-pennies’ worth...
Software Localization | Language Lead (BR) | Volunteer Champion na Apple
5 年Very interesting article! I’ve been working with Neural MT and the output is amazing - especially when comparing to the Statistic MT. It’s absolutely suitable for many technical texts. I hadn’t thought about the cognitive effort you’ve mentioned, but wouldn’t it be fair to compare it with what happens to the TM leverage on regular translation? I mean, if you have a file full of high fuzzy matches, your effort will be proportional, thus you are paid by weighted word count. Is TER the new WWC in that sense?
Traductrice | Contentieux et arbitrages internationaux | anglais/espagnol/norvégien > fran?ais
5 年Thank you for this article Diego! I think that cognitive effort is very underestimated, both in terms of proofreading and post-editing, and I feel like only the people who actually do it know all the concentration it takes.