Why you should not base your workflow process decisions on any segment-level score (including Phrase’s new QPS)
An attempt to understand the quality of the translation by assigning segment-level scores.

Why you should not base your workflow process decisions on any segment-level score (including Phrase’s new QPS)

As I watched the recent video presentation of the Quality Performance Score (QPS) from Phrase with great interest, it raised some pertinent questions that I felt compelled to share.

I did like the fact that MQM (Multidimensional Quality Metrics) was mentioned 52 times during this presentation. After all, my entire research in the field of translation quality evaluation has been centered around MQM ever since my participation in the QTLaunchPad project in 2012. It's great to see that the long-time, continuous effort of outstanding volunteer experts is now recognized as an invaluable tool by Google Research and other key industry players, including Phrase. I am also happy that the vision I formulated in 2020 about translation quality evaluation becoming more, not less important, with the advent of AI is indeed becoming a reality.

But two things nagged at me: the fact that QPS is in only very indirectly built on MQM, and of course my understanding that such segment-level evaluations cannot be accurate, stable, or reliable. Let’s have a look why.

From the presentation, it appears that QPS is a segment-level direct estimate of quality on the scale of 0 to 100 (made by some AI model trained on some data). In NLP, this scoring method is called Direct Assessment (DA), and it has been around for a while. In the process of Direct Assessment, humans rate each segment of the output from an MT system with an absolute score or label. The method has been in use since the WMT16 challenge in 2016.

It bears noting that initially, the humans for the DA on WMT challenge were sourced from Amazon's Mechanical Turk crowdsourcing environment. I remember being both horrified and amused when I learned that the NLP researchers base their “human parity” claims on the DA ratings from Mechanical Turk.

It is trivial knowledge in any human activity that the quality of evaluation depends on reviewer’s qualification. If you are not a mechanical engineer, you would not be able to do a quality evaluation of another mechanical engineer's work. If you are not a lawyer, you cannot judge legal work. If you are not a medical professional, you are unable to assess the quality of professional medical services or advice. I could never understand how people fail to see that the same applies to language and translation. Clearly, the basic premise of proper translation quality evaluation is that evaluation should be conducted by a qualified linguist.

But there's so much more to it. The holistic DA on a segment level is problematic in and of itself.

For example, you may be asked to rate the movie on a scale of 0 to 5 based on how you like it in general. Naturally, in this type of assessment you are not presented with the whole universe of criteria that critics operate on – you just provide your overall impression.

The strength of such “holistic” assessments is that they look simple and uniform. The holistic approach can be quite powerful when applied to large and complex samples. But when it’s applied to a particular segment and not the entire sample, its weaknesses prevail:

? The holistic approach loses a lot of details that could be very important for evaluation results.

? It is much more random than analytical assessments.

? It is much less stable and consistent.

? It does not take into account the context of the previous and following sentences.

? It is therefore much less precise than even the analytical segment-level assessments.

But let me explain here what analytical segment-level assessment is.

In 2021, Marcus Freitag from Google et. Al. published a paper entitled “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation” [1]. In this paper, an MQM-based segment-level assessment called SQM was pioneered. It was later used for the WMT2020 metrics task.

The method works as follows: annotators go through segments and annotate errors with the MQM error typology. Afterwards, a segment-level scoring similar to DA is formed following a certain scoring formula that’s explained in the paper.

Such SQM metric can be, with some stretch, called an “MQM-based” metric, because, at least, the error annotation was done in accordance with the MQM typology.

IF you do an analytic annotation of errors first, and THEN calculate the segment score with some sort of scoring formula based on sentence penalty points, this can be called an MQM-based metric, although not completely.

It is not the full MQM-based metric because MQM is about assigning MQM score to a sample, not to the translation unit! Individual segment-level scores statistically make little sense in principle due to their low reliability which is caused by significant variance in annotator judgments [5].

There are two very important reasons for that:

(a) Sentence-level scoring is not precise or reliable in principle due to the statistical nature of errors and variance of judgment by annotators [7]. It has been shown that human error annotations may vary significantly due to plethora of reasons [7]. That’s why statistically, information about errors in a sample of less than 1,000 words is not reliable [6]. Naturally, the model which is trained on such data, is even less reliable as far as segment-level scoring is concerned. This is fundamental and cannot be improved by using bigger model ??.

(b) Sentence-level scoring misses even the most proximal context completely, and there are many sentences that can have very different translations depending on their context.

BUT if you don't even do analytical error annotation and simply assign the segment a score, it is not an MQM-based metric, it is not a reliable score, and it won’t be accurate regardless of how you obtained it: by human evaluation, AI, or from another type of language model.

Presenters from Phrase hinted that the score was obtained from some not-GenAI language model that was pre-trained on human evaluations from historical data.

Well, first of all, if that data was “MQM segment-level scores,” then the training data is not reliable in the first place for the reasons explained above.

Second, unlike human evaluators, an AI model cannot capture all errors. Humans see more errors than any automatic AI metric, and therefore such a metric will inevitably inflate quality scores as compared with even human segment-level evaluation, as we demonstrated clearly in our extensive work [3].

Third, the accuracy, reliability, and stability of such predictions are evidently low in principle. That is, it is probably more reliable and accurate than zero-shot direct assessment from GenAI (GEMBA-SQM, implemented in our Perfectionist TQE tool [4]), but this has yet to be measured.

And here comes the final point: even though everything related to AI in many of its forms typically becomes a part of the media hype cycle, it is first and foremost about research. And considering the multiple possible implications of anything AI-related on the world as we know it today, we want results of that research to be reliable and responsible, without little-substantiated claims that media so loves.

AI and NLP desperately need proper benchmarks and verifiable transparency, not unsubstantiated claims and process decisions made on a basis of unreliable scoring.

The language industry needs research and implementations that are transparent, have certain rigorous science and math behind them, and needs published research, disclosing the language model used, allowing us to have a look at the training dataset and samples of the data. This will enable us to reproduce the results and test their veracity, accuracy, and reliability, and be confident about them.

The method that Phrase is using is similar to COMET [2], but while we know that COMET has a great idea behind it, it requires actual implementation rigor to be truly trustworthy and reliable for concrete applications, and remains an indirect automatic metric not fully equivalent to human judgment [3].

To conclude,

? Segment-level direct assessment prediction is not accurate, reliable, or stable.

? Segment-level DA prediction misses even immediate context.

? MQM is not just a typology, it is a typology and a sample-based (not segment-based) scoring model, and reliability statistically comes from the size of the sample.

? Predicting scores with AI model will lead to the same problem as with other automated metrics: they inflate the quality score as compared to human judgment.

? Segment-level SQM score based on MQM annotation is not a reliable set of data for AI training.

Therefore, it's a bit premature process-wise to base project management decisions on Phrase QPS scores.

I would definitely not recommend following this route in all but a very narrow number of cases, and the applicability scope is currently unclear.

For better or for worse, the sample-based human evaluation truly based on analytical approach remains the only reliable gold standard for translation quality evaluation.

References

[1] Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation, Markus Freitag et.al., 29 April 2021, https://arxiv.org/pdf/2104.14478.pdf

[2] The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics, Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Alon Lavie , Luisa Coheur, André F. T. Martins, 19 May 2023, https://arxiv.org/pdf/2305.11806.pdf

[3] Neural Machine Translation of Clinical Text: An Empirical Investigation into Multilingual Pre-Trained Language Models and Transfer-Learning, Lifeng Han, Serge Gladkoff, Gleb Erofeev, Irina Sorokina, Betty Galiano, Goran Nenadic, https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2024.1211564/abstract

[4] GEMBA-SQM translation quality evaluation is easy to implement as zero-shot LLM prompt … and totally useless, Serge Gladkoff https://ai-lab.logrusglobal.com/gemba-sqm-translation-quality-evaluation-is-easy-to-implement-as-zero-shot-llm-prompt-and-totally-useless/

[5] Assessing Inter-Annotator Agreement for Translation Error Annotation, Arle Lommel, Maja Popovi?, Aljoscha Burchardt, DFKI https://www.dfki.de/fileadmin/user_upload/import/7445_LREC-Lommel-Burchardt-Popovic.pdf

[6] Measuring Uncertainty in Translation Quality Evaluation (TQE), Serge Gladkoff1 Irina Sorokina1,2 Lifeng Han3? and Alexandra Alekseeva, [2111.07699] Measuring Uncertainty in Translation Quality Evaluation (TQE) (arxiv.org)

[7] Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce, Serge Gladkoff, Lifeng Han, and Goran Nenadic, [2303.04526] Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce (arxiv.org)

Leonid Zemtsev

I save shareholders from headaches and sleepless nights by bringing order and subordinating chaos to rules. I solve problems, motivate teams to achieve goals, and streamline processes to deliver outstanding results.

3 个月

??

回复
Ala Uddin

Experts in making websites for business owners | Your development partner | Generate 5X more revenue with a high-converting website | Sr. Software Engineer | Founder @KodeIsland.

4 个月

Serge, thanks for sharing!

回复
Craig Stewart

Director of AI Research at Phrase

1 年

Automated evaluation at the segment-level is not the holy grail; emerging technologies like GenAI offer opportunities to move in the direction of granular, document-level evaluation. We share an interest in keeping MQM at the forefront of our research and I'm extremely excited to see how we can get somewhere closer in the near future to the panacea that you describe. QPS is a small piece in a much bigger puzzle; it sits in our product as a compliment to lots of other checks and balances including human evaluation. We'll continue to build on it, explore other use cases for it and drive the technology towards continually more robust and interpretable evaluation.

Craig Stewart

Director of AI Research at Phrase

1 年

Hi Serge, Great article! You make a lot of good points that mirror many of the known challenges in automated evaluation; I sit on the committee for the WMT Metrics Shared task (with Markus Freitag ) and these are problems we discuss at every single meetup. Segment-level automated evaluation is indeed not the silver bullet to adequately representing translation quality but is rather one of the tools in the toolbox that can allow broad, high-level oversight. You are absolutely right that for the moment, expert human evaluation at the document level is as close as we can get to understanding translation quality. The challenge that automated, segment-level scoring systems (QPS, COMET and many, many others) seek to address is that human evaluation is not scalable. The billions of words that flow through machine translation pipelines in particular, are a blindspot to most. Sure, there are cases in which the automated system will miss the mark but in most cases I would much prefer to have some insight than none at all.

David Turnbull

Experienced IT-EN translator with interests in MT, AI and Quality Management

1 年

Thanks for this very interesting piece. I was on that webinar too and am a long-term frequent user of Phrase. I agree with the limitations you point out in your post - What are these scores actually based on? How are we supposed to trust them? - and the prematurity of relying on them to make process decisions. Furthermore, the overnight introduction of this new and highly granular system, which provide a falsely inflated sense of confidence as to the quality score attributed to MT (what does 92/100 mean anyway?), have real-world impacts on the post-editor's approach and, potentially, their remuneration where the QPS is tied to a pre-existing net rate scheme. Phrase has said QPS is a work in progress and we may be able to customise it with weightings per client or text type in future, at which point the scoring becomes pretty arbitrary and subjective anyway. In my opinion, assigning anything other than a broad indicator as to the possible quality of the MT (especially on a segment level) is misleading and potentially degrades the work and conditions of translators.

要查看或添加评论,请登录

Serge Gladkoff的更多文章

社区洞察

其他会员也浏览了