Super-Human Translation?

Super-Human Translation?

Can Large Language Models Help Improve Translation Quality at Scale?

In a study released at the end of February, Microsoft engineers Tom Kocmi and Christian Federmann set out to test whether they could use LLM to evaluate translation quality.

The researchers agree that both humans and large language models have their own strengths and weaknesses in evaluating translation quality. People are generally still better at understanding context, idiomatic expressions, and cultural nuances, all important to accurately convey the intended meaning of the source text. They can also identify translation errors that may not be obvious to an automated system.

Large language models, on the other hand, can process and evaluate translations much faster than humans, making them more suitable for handling large volumes of text. They can also provide more consistent evaluations since they are not subject to human biases or fatigue.

According to the researchers, Large Language Models (LLMs), like GPT possess inherent support for multilingual question-answering capabilities, enabling translation between languages even without fine-tuning for the task. However, while GPT-based translation achieves high quality for high-resource languages, it falls short in terms of underrepresented languages.

Inspired by this, the researchers explore the use of LLMs for automated translation quality assessment, aiming to differentiate between good and bad translations. To address this, they propose GEMBA (GPT Estimation Metric Based Assessment), a metric that evaluates individual segment translations and averages the scores for a system-level assessment.

The study demonstrates state-of-the-art performance on the MQM 2022 test set for three language pairs: English to German, English to Russian, and Chinese to English.

The study defines and evaluates various prompt variants for zero-shot translation quality assessment in two modes: with or without a human reference translation.

Zero-shot prompting is a technique in natural language processing that enables a model to make predictions about previously unseen data without the need for any additional training. In other words, it allows the model to generalize to new tasks without being explicitly trained on them. A zero-shot prompt works by giving a prompt or input to a language model, which then generates an output or completion without any further training.

In a zero-shot prompt, the input prompt or context is used to guide the language model's output. The language model generates a response or completion that is consistent with the input prompt, without the need for additional training. This approach contrasts with traditional machine learning techniques that require a large amount of labeled training data to make accurate predictions.

Zero-shot prompting can be useful for a variety of tasks such as classification, summarization, and question-answering. It is also possible to extend zero-shot prompting to more complex tasks like generating a chain of thought, which involves generating a coherent sequence of outputs given a prompt. However, it is important to note that while prompt-based zero-shot learning can achieve good results in a fully unsupervised setting, it does not necessarily outperform its supervised counterpart.

The study reveals impressive advancements in assessing translation quality using GPT-based models. The researchers showcase their state-of-the-art capabilities by evaluating translations based on the latest WMT22 metrics evaluation data, focusing on system-level assessments.

The experiments are conducted with four different prompt templates and find that the template with the fewest constraints yields the best performance in evaluating translation quality.

Furthermore, the researchers examined seven different models of GPT and discovered that only GPT 3.5 and larger models demonstrate the ability to accurately assess translation quality.

It is important to note, however, that the metric used in the study, GEMBA, is not yet reliable enough for evaluating translations at the segment level. Therefore, it is recommended to utilize GEMBA primarily for system-level evaluation purposes.

Future research endeavors aim to delve into the application of GPT models for quality assessment. The researchers intend to explore the transition from zero-shot to few-shot methodology, as it holds promise for enhancing the accuracy of GEMBA.

Few-shot prompting refers to a methodology used in natural language processing (NLP) tasks, particularly in the context of machine learning models like GPT (Generative Pre-trained Transformer). In traditional machine learning, models are typically trained on large datasets with a vast amount of labeled examples. However, few-shot prompting takes a different approach by training models on a smaller subset of labeled examples, allowing them to generalize and perform well on new, unseen examples with limited training data.

In the context of GPT models, few-shot prompting involves fine-tuning the pre-trained model using a small number of labeled examples or "shots." Instead of training from scratch, which would require a massive amount of labeled data, the model is adapted to new tasks or domains by providing a limited amount of specific training examples. These examples are used to guide the model's behavior and enable it to generalize to similar, unseen instances.

By using few-shot prompting, models can learn to perform new tasks or provide accurate responses with only a handful of labeled examples, reducing the dependency on extensive training data. This approach is particularly useful in scenarios where gathering a large labeled dataset is challenging or time-consuming, allowing for more efficient and practical application of machine learning models.

They also plan to investigate model fine-tuning to further refine the assessment outcomes. Another avenue of exploration involves adapting prompts to support MQM error-based evaluation or post-editing efforts, which could lead to additional improvements.

What does the future look like for Translation QA?

The utilization of GPT-enhanced evaluation metrics holds the potential for significant advancements in document-level evaluation. The researchers highlight the advantage of larger context windows offered by GPT models, which can be beneficial in this context. It is important to note that the current body of research on document-level metrics is limited, making it an intriguing and promising area for future exploration.

While the preliminary results suggest that the GEMBA metric demonstrates strong performance compared to other automated metrics evaluated in the WMT22 Metrics shared task, it is important to consider certain limitations. These results are primarily based on human labels for only three language pairs. It is anticipated that the metric's performance may be compromised for other language pairs, particularly those that are under-resourced, as demonstrated by Hendy et al. (2023) who highlight lower translation quality for such languages.

Moreover, it should be noted that while GEMBA achieves state-of-the-art performance on a system level, there is still room for improvement at the segment level. The reported results provide an indication of the potential performance that Large Language Models (LLMs) could eventually achieve for the task of translation quality assessment.

While humans tend to be better at evaluating the nuances and subtleties in translation quality, and large language models offer speed, consistency, and scalability a combination of both human expertise and large language models could provide optimal results in optimizing translation quality in the future.

Bernhard Sulzer, MA

Author/German Instructor/Translator ??Law ??Business/Marketing ??Books. Helping you communicate effectively in German and English. Ideal positioniert für englische und deutsche übersetzungen. Seit 1998. +1 419 320 7745

6 个月

"A study conducted by ?????????????????? ?????????????????? ... LLMs offer lightning-fast processing and consistent evaluations, devoid of human biases or fatigue...?????? ?????????????? ?????? ????????????????????! The study showcases GPT-based models achieving state-of-the-art performance on the MQM 2022 test set for English-German..." Impressive? You expect anything else from Microsoft? Questions: a) How is AI devoid of human biases? Did they define bias? AI is programmed and trained by humans. Could Microsoft and the Internet be biased raving about AI? b) GEMBA - a program - evaluates translations? How? How does a program that clearly isn't sentient evaluate anything? It just computes based on data fed into it. c) About the MQM 2022 test set. How is it state-of-the-art performance? Evaluated on which basis? Without human input?! Some thoughts from the article below: "Compared to automatic quality evaluation, human quality evaluation does not depend on text formatting details or the dataset per se, and correctly captures the human perception of quality....In contrast to human evaluation, automatic evaluation of quality quickly produces the same scores if repeated a number of times, which creates an illusion of precision."

Lee Densmer

I build and manage B2B content marketing programs that drive growth / Specializing in the language services industry ??

1 年

It's an interesting call-out that humans have biases (by now we all know this) and fatigue (every one of us loses steam after doing one task for a while). Computers don't get worn out.

Superhuman translation has been a reality for a long long time. It depends on how you measure and conclude it is super human. If for example you focus on speed it was already faster than human for gisting 50+ years ago when we started to translate Russian to English to assist hundreds of human translators with the Air Force. It was not a complete Victory for Machine versus human though obviously because sometimes it was not on topic when it came to translating something with ambiguity or never seen before.

Boryana Nenova

Making things work

1 年

Automated content creation and verification, then translation, PE, LQA > auto-summarize and auto-read or reply. Then rinse and repeat? ?? I'm seriously considering upskilling in farming...

要查看或添加评论,请登录

社区洞察

其他会员也浏览了