Terminology extraction as a tool for MT output assessment and improvement

Terminology extraction as a tool for MT output assessment and improvement

Abstract

This document describes the benefits of embedding semantic term extraction analysis processes directly into the translation workflow system of a language services department (departments that are typically in charge of translating documents within organizations). The paper proposes the use of a terminology recall index that calculates the frequency and stemming information of noun groups only. This paper proposes to a) demonstrate the utility of comparing the TRI of a human translation with that of a neural machine translation; and b) demonstrate that the TRI has many other surprising applications within a translation workflow system.


1          Introduction

Most often, metrics currently used for evaluating the percentage of similarity between two documents in the same language are based on various types of edit distance algorithms. The edit distance calculation represents the effort required to make two documents identical. The bilingual evaluation understudy (BLEU) score is also based on an edit distance algorithm. This paper will first demonstrate, from a linguistic point of view and as explained in Section ?2, why this metric is problematic.

It will then propose a more accurate metric with which linguists (Linguists are employees in language services departments. Their roles include translator, post-editor, editor, reviser, coordinator, proof-reader, and terminologist). can evaluate the edit distance between two target versions of a given document using semantic term extraction algorithms: the terminology recall index (TRI). It will also propose that an automated TRI report be generated automatically as part of the translation workflow process.

This paper will cover in detail the TRI applications identified to date that could be integrated into a translation workflow with the primary goal of automating the terminology quality assurance (QA) process while improving productivity.

In conclusion, the paper will describe the currently identified limitations of this approach and explore possible avenues for research and development in order to remedy these limitations.

2          Context and assumptions

During the translation phase (A translation request can go through different processes, such as pretranslation, post-editing, editing, proofreading, bilingual revision, typesetting, desktop publishing, bitext creation, etc. Requesters submit their requests either via a translation workflow platform or by email) , searching tasks take between 30 and 40 percent of linguists’ time (Depending on the nature and complexity of the content, searching time can be as much as 50% and as little as almost nothing), as explained under Subsection 2.1. The terms that linguists search for are mainly noun groups, and the terminology databases maintained internally by language services are also mainly composed of noun groups, as explained under Subsection ?2.3.

 2.1        More than one day per week spent searching

For quality and consistency purposes, linguists often search for terms in many different sources. Depending on his or her role, a linguist may use the following sources:

·     Glossaries and dictionaries, for finding definitions, synonyms, etymology, etc.

·     Internal and/or online terminology databases, for finding specific terms for a specific domain or client

·     Bitext search engines, for finding terms and their translations in context

·     Translation memories, for finding bilingual concordances of a term

·     Full-text search engines, to see how a term is used in the original language

·     Full-text search engines, for finding previously translated documents or reference materials

In all these cases, excepting that of the final bullet point (The reference searching task is usually done by coordinators and does not require any linguistic skills), the part of speech most often searched for is the noun group. It is also the most time-consuming part of speech to process. Although other parts of speech are also searched for, they do not take up as much time as noun groups. With the exception of idiomatic expressions , other parts of speech are not challenging for linguists. In some language services departments, if a search garners a satisfactory result, linguists will update the internal terminology database (TB), whether upon instruction or of their own initiative.

 2.2        Noun groups in terminology databases

Most internal language services departments in organizations maintain an internal TB. This database mainly comprises records for noun groups. The same linguists who search within this TB also create and maintain term records, as described above under Subsection 2.1.

 2.3        A more accurate metric

Based on these observations, as covered under Subsections 2.1 and ?2.2 above, and as the BLEU score treats all parts of speech equally, whether low-weight semantic entities (articles, conjunctions, prepositions, etc.) or high-weight semantic expressions (noun groups, etc.), it is safe to assume that a more accurate metric based on noun groups only would be useful to linguists.

3          Term extraction for noun groups

This section describes how a semantic term extraction engine, required for calculating TRI, has been developed by combining existing term extraction algorithms with existing part-of-speech-tagger algorithms.

3.1        Existing term extraction engines

Most existing term extraction engines are based on a statistical model, using lists of “stop words” for noise reduction. Term extraction engines are somewhat useful for identifying patterns of co-occurring words, sorted by frequency. However, besides some basic stop word noise reduction algorithms, statistical term extraction remain “noisy” and cannot generate high quality noun group extractions. Some term extraction tools provide user interfaces to manually remove noise entries.

3.2        Existing tagger engines

Existing taggers are able, among other things, to a) Identify parts of speech and modifiers for each word in a given sentence; b) to build a sentence grammar tree. Since taggers are more language sensitive, supported languages are more limited than for term extractors. Though grammatical information is usually well extracted, it does not provide clearly extracted, useful to linguists, “noun groups”.

3.3        Integrating taggers into term extractors

By “plugging” the multiple word phrase outputs, generated by a regular statistical term extraction, into a tagger engine, a “semantic term extraction” engine can be developed and integrated. This first and fast statistical pass helps in reducing basic stop word noises and to retain multiple words only. During the second pass, a tagger is used for identifying part of speech patterns enabling to mark an extracted phrase as a potential noun group.

For basic disambiguation purposes, only multiple word phrases are retained. Lemmatized single words are highly subject to polysemy.

4          TRI applications

This section lists possible TRI integrations and their respective applications.

4.1        TRI integration with MT

Here are some examples of integrating TRI with MT engines: 1) Calculate the TRI between a human translation and a machine translation of the same document, allowing linguists to obtain a general TRI and identify which noun groups were and were not properly translated by the MT engine. 2) Use the TRI for analysis purposes when running bench tests, or when comparing MT engines.  2) Employ the TRI to identify which MT engine should be used for a given source document by sending only the most frequent noun groups to the MT engines and cross-checking the output against an internal terminology database, or against the terms extracted from the human translation.

4.2        TRI integration with source documents

Here are some examples of TRI applications for documents prior to translation: 1) Calculate the TRI by cross-checking against existing terminology in order to identify which terms are and are not present in the internal terminology database. The list of known terms present both in the document and the term database is called a “job glossary” extraction, while the list of unknown terms in the same document is called a “unilingual term candidate” extraction.

4.3        TRI integration with target documents

Generating a TRI that compares the final target document with internal terminology, or with a job glossary initially generated from the source document, can provide linguists with the following: 1) A list of terms present but not properly translated; 2) A list of terms not present at all; 3) A TRI for automatic workflow notifications, or for automatically feeding a database of term candidates. Linguists in charge of terminology could take a few moments to review the weekly TRI reports, which would become a tool for monitoring the quality of terminology.

5          Limitations

The two limitations for deploying TRI are as follows: 1) Since the TRI retains noun groups of two or more words only, for co-occurrence semantic reasons, it is less useful in subject matter fields in which single-word noun groups are more frequent; 2) The list of available taggers for languages other than English, French and German are more difficult to find and/or integrate into code.

6          Future developments in terminology automation

Apart from the two limitations mentioned in Section ?5 above, for which remedies should be achievable in the very near future, bilingual noun group extraction and/or translation spotting could be integrated by automatically identifying suggestions for the unilingual term candidates TRI list, which would be generated upon the initial receipt of a document for translation. This list of bilingual term candidates could also be automatically added to a terminology database with the same name. Then, a quick review and validation/rejection task could be performed on a weekly basis.


要查看或添加评论,请登录

Jean-Francois Richard的更多文章

  • Service d’alignement de sites Web

    Service d’alignement de sites Web

    à quoi ?a sert? Le service d’alignement de sites Web de Terminotix vous permet de convertir un site Web multilingue en…

    1 条评论
  • JIAMCATT 2016

    JIAMCATT 2016

    Terminotix a été invitée à l’événement JIAMCATT 2016, qui se tiendra du 27 au 29 avril à l'Organisation météorologique…

    2 条评论

社区洞察

其他会员也浏览了