登录查看更多内容

Translation Quality Evaluation Is All We Need

Serge Gladkoff

Logrus Global

发布日期: 2023年8月23日

“The unpredictable abilities emerging from large AI models: Large language models like ChatGPT are now big enough that they’ve started to display startling, unpredictable behaviors.”

“Recent investigations have revealed that LLMs can produce hundreds of “emergent” abilities — tasks that big models can complete that smaller models can’t.”

“New analyses suggest that for some tasks and some models, there’s a threshold of complexity beyond which the functionality of the model skyrockets.”

These March headlines (along with the hundreds of others) took the AI hype to new heights. In the AI/ML circles, all everyone would talk about was the General Artificial Intelligence literally happening tomorrow.

The lists of the emergent abilities were compiled. Some people grew super excited, while others – super fearful. Petitions were drafted to implore the authorities to put a stop on training neural models so as to prevent inevitable harm. The clamor just wouldn’t cease, and this went on for several months.

It shouldn’t come as much of a surprise then that the last week’s release of an article entitled “Are Emergent Abilities of Large Language Models a Mirage?” https://arxiv.org/pdf/2304.15004.pdf by the Stanford University researchers was greeted with much less attention and fanfare. The question in the title of the article is, of course, a very polite way of decisively stating that the aforementioned emergent abilities really are a mirage. They do not exist.

In the dry language of academic papers, the article’s authors wrote: “Emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance”.

Oops…

“Changing from a nonlinear metric to a linear metric such as Token Edit Distance, scaling shows smooth, continuous and predictable improvements, ablating the emergent ability.”

Well, there it is. Seems like the first thing that ML/NLP researchers should be taught is to verify their metrics and benchmarks before coming out with hype-inducing announcements. Listening to professional linguists and practitioners of the translation craft would be another excellent advice to follow.

Since the multi-headed attention mechanism was invented, we have been witness to a constant pattern of AI/ML hype. If it had remained solely a media phenomenon, it would be one thing. Unfortunately, this hype interferes with the smooth production process that has been established in the localization industry, confusing both the service suppliers and customers alike.

If you ever studied a crab closely, you would’ve noticed that one of their pincers is always much greater than the other – as they constantly act with their dominant claw, the other withers. Today, machine translation and NLP look very much like such a crab: almost anybody can take some data, put it into some model, and get some sort of result. But when it comes to evaluating the results and measuring their quality, almost nobody seems to have this ability.

It’s time to learn how to measure the quality of MT translation output correctly!

The “classic” BLEU and Rouge metrics are actually similarity metrics, not “quality” ones – and our recent studies have demonstrated that such a metric as COMET is completely defunct in its standard implementation. To overcome these shortcomings, we have created our proprietary but publicly available LOGIPEM/HOPE https://arxiv.org/pdf/2112.13833.pdf metric that can be used to measure the quality of MT output – and this only requires one hour of a professional human linguist’s time. The evaluation can be done with our Perfectionist (https://perfectionist.app) tool.

And if truth be told, the localization industry really shouldn’t skimp on that one hour and compare different MT outputs before choosing the engine and making other important decisions, such as the economic feasibility of a custom training of some LLM.

Proper LQA/TQE is really all we need to verify the claims of any upcoming breakthrough.

And yes, the room temperature superconductor does not exist either. It was yet another unverified claim. Sorry to burst your bubble, but wouldn’t it be nice if the claims were verified and tested by professionals and then announced? That’s what we are doing at our R&D Lab.

Follow our newsletter and research to save time and find efficiencies that work!

We take inspiration from real life use cases and always test the facts. You are welcome to share your use case and data to participate in our R&D effort. Book a conference call to discuss R&D collaboration.

Belen Rubalcaba

Localization Director at PTC

1 年

Sebastian Dzi?cielski

Localization Program Manager | Localization Technology Expert

1 年

People often forget that just because something is said on TV doesn't make it true. :) With technology, you must always test it to see how it actually works.

查看更多评论

要查看或添加评论，请登录

查看全部

Translation Quality Evaluation Is All We Need

Serge Gladkoff

Logrus Global

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

The LLM Revolution: Exploring the Depths of Large Language Models

How to get more out of LLMs

Tailoring Titans: Customizing Large Language Models for Industry-Specific Mastery

Unleashing the Power of Large Language Models (LLMs): Transforming Communication and Innovation

July 16th Part 3 - Benchmark Tests for Large Language Models | Relationship between LLMs, KGs, Ontology

AI for Language Technologies - July 2024 -

Finetuning Large Language Models: A Comprehensive Guide

The Power and Promise of Large Language Models: Unlocking the Next Frontier of Artificial Intelligence

领英推荐

Why no AGI can be built with language models

2024年5月28日

A Thought Once Spoken Is a Lie (The Fundamental Reasons for Uncertainty and Low Inter-Rater Reliability on a Sentence Level)

2024年5月23日

Why you should not base your workflow process decisions on any segment-level score (including Phrase’s new QPS)

2024年2月8日

Our Most Significant R&D Result in 2023: Edit Distance Prediction Method

2023年12月18日

An ideal character for prompt engineering

2023年12月16日

“You can’t make this up” - this wisdom is outdated, and here’s what we need to do.

2023年12月6日

Why it is important to acknowledge the lack of intelligence in “AI”

2023年11月21日

GEMBA-SQM evaluation is easy to implement as zero-shot LLM prompt … and totally useless

2023年10月22日

On closer MQM inspection: Pullet Surprise A GPT4 Exercise in English Proofreading

2023年9月21日

1968 Space Odyssey Is a Warning to All of Us Using ChatGPT and LLMs in 2023

2023年9月6日