登录查看更多内容

Converting Paragraph-Based TMX to Sentence-Based Segmentation

Víctor Parra

Head of Localization Engineering @ LanguageWire

发布日期: 2023年10月24日

Performing a resegmentation of a previously segmented translation memory using a different criterion is not an easy task, nor is it 100% reliable. It requires necessary checks to ensure that we achieve our objective, and if the resegmentation is not fully applied, some manual work may be required. The first step is to open our translation memory in a rich text editor (such as Notepad++), and verify that our memory is indeed segmented by paragraph by examining the "segtype" element in the header and checking some translation units stored in the memory:

We can also perform a double check of the segmentation by opening our translation memory in a TMX editor (such as Heartsome TMX Editor):

Paragraph-based segmentation in Heartsome TMX Editor

Method 1: Alignment

Most CAT (Computer-Assisted Translation) tools come equipped with a semi-automatic alignment tool that allows us to create translation memories from multilingual corpora to leverage previous work for future tasks. For this demonstration, I will be using the alignment tool provided by OmegaT, but you can use any other tool such as memoQ, Matecat, Okapi Framework, Trados, Wordfast Autoaligner, among others.

The first step is to clean our raw translation memory, which means obtaining a document with only the source language segments and another document with the target language segments. To do this, we will use a rich text editor and a pair of regular expressions (regex). The regex will depend on the structure of our memory.

First, open the translation memory in Notepad++ and adjust the regex to extract only the source language segments:

Copy the marked text and paste it into a new file:

Next, we remove all unnecessary XML tags to leave our file with only the text of the translation unit and its inline tags. We can do this by performing a Search and Replace operation using regex again:

Finally, we do the same with the target language translation units and obtain two documents like this:

Source and target text extracted and cleaned up side by side

This procedure can be long and tedious, but it is the way to ensure that our tags in the translation memory will be preserved after alignment and it's essencial for an optimal leverage.

Now let's get into action. We open our aligner (as I mentioned, I will use the one from OmegaT, but you can use the one that suits you best). The most important part of this step is to adjust the segmentation rules of our CAT tool/aligner to obtain the desired segmentation:

We load the two documents into the tool and perform an auto-alignment. We then check the result and make the necessary adjustments to ensure that our memory is perfectly segmented and aligned:

And voilà! Our translation memory is now perfectly resegmented by sentence, whereas before it was segmented by paragraph:

Method 2: Forced Resegmentation with SRX Rules

For the second method, we will use the Rainbow application from the Open Source localization engineering suite Okapi Framework. This method is much faster than the previous one as it eliminates the manual preprocessing of raw translation memories. The procedure is straightforward and guided, and it involves loading the TMX file with the appropriate filter, adjusting it to our needs, converting that memory into an XLIFF file with source and target segmentation based on the SRX standard, adjusting it to the desired final segmentation (by sentence), and then converting that resulting XLIFF into a new TMX file with the new segmentation.

Let's begin. Open Rainbow and add the translation memory in the Input 1 tab. Select the appropriate filter (TMX, in this case):

We proceed to create the XLIFF file from our translation memory by following the following path of options:

Olga Filipenko 3 个月前

Translation Industry Trends in 2022

Ofer Tirosh 2 年前

What is developing a real time AI translator good for?…

Deutsche Telekom IT Solutions HU 1 个月前

When entering the translation package creation process, we encounter a screen to configure the new package we are going to create. First, we set how we want the segmentation to be. This is where we configure the resegmentation of the file by choosing SRX segmentation rules and forcing the resegmentation for both the source and target:

Please note that, at this stage, the most important aspect is defining the segmentation rules. They need to be clear and concise to ensure bidirectional segmentation (source and target). If you're unsure, you can click on the "Edit..." button once your rules are loaded, and it will open an SRX editor with your rules already loaded and real-time preview editing (Ratel):

Now, select the option and configure the output file (XLIFF in our case):

Upon executing the action, a package is created, and within it, we find a folder called "Work" where our XLIFF file is located. The XLIFF file contains the source text stored in our translation memory, now segmented by sentence, as well as the segmented target text:

Open the resulting XLIFF file in a rich text editor and verify that the text has been segmented according to our preferences:

And indeed, it has been done. It is worth noting that the segmented text in an XLIFF 1.2 file is displayed as follows: for the source, it is enclosed within <seg-source> tags and delimited by <mrk> tags, while for the target, it is directly enclosed within <mrk> tags within <target> tags.

We can also open the XLIFF file in our CAT tool to verify the segmentation:

Sentence-based resegmented XLIFF in XLIFF editor (memoQ)

To conclude, we load the resulting XLIFF file back into Rainbow, in the Input 1 tab, and the XLIFF filter will be automatically selected:

Open the menu Utilities > Conversion Utilities > File Format Conversion...

Finally, in the configuration window that appears, select the following options and execute them:

A TMX file will be created alongside the converted XLIFF file, but it may contain some unnecessary information in the prop attributes:

That can be cleaned up using the following regex in a Search and Replace operation:

And manually change the "segtype" from "paragraph" to "sentence".

Now we will have our translation memory resegmented by sentence, while preserving our tags, fully optimized for the intended use:

Final sentence-based segmented TMX file in Heartsome TMX Editor

LocEngineering

2,176 位关注者

Sindhura BL

Looking for a full-time job as a Test Automation Engineer.

1 个月

I have created segments using sentence type, now how to convert it to paragraph based segmentation.

Kevin Lossner

Independent Writer and Language Services Teacher

4 个月

Food for thought here, thank you. Now I'll try to translate all that into a process with my usual toolset. It's been too long since I messed with the Rainbow suite.

1 次回应

Sergio Alasia

Perfecting brand communication across 15 languages, one word at a time | Language technology specialist | Entrepreneur

9 个月

Hi Víctor, thank you for your contribution, it was very helpful. I'd like to suggest a slight improvement to the 2nd method, though. There is no need to convert into XLIFF and back to TMX. You can set up a simple custom pipeline like you see in the pic to process the TMX file in Rainbow. Afterwards you'll still need to delete the <prop> tags, but there are a lot fewer.

1 次回应

Anna Ward

Head of Localization @ Unite | Freelance Translator German to English | Freelance Proofreader & Copyeditor for English

12 个月

We love your posts over at Unite's loc team! Alejandro, right?

2 次回应

Robert Lo Bue

?? Operations ?? BioTech

12 个月

Excellent post Victor! Really helpful for a lot of people I am sure!

1 次回应

查看更多评论

要查看或添加评论，请登录

Víctor Parra的更多文章

ID-Based Alignment: A Technical Deep Dive for Localization Professionals

2024年6月10日

ID-Based Alignment: A Technical Deep Dive for Localization Professionals

NOTE: All the software used in this article is Open Source, meaning it is free and can be modified according to your…

3 条评论
Batch Files for Localization Engineers: Part I - File Preparation

2024年5月21日

Batch Files for Localization Engineers: Part I - File Preparation

NOTE: All the software used in this article is Open Source, meaning it is free and can be modified according to your…

7 条评论
Efficient Text Extraction for Localization Excellence

2023年12月19日

Efficient Text Extraction for Localization Excellence

In the complex realm of translation and localization, file filters hold a position of extreme importance. These…

17 条评论
The Tunnel under Ocean Boulevard of QA in Localization

2023年12月9日

The Tunnel under Ocean Boulevard of QA in Localization

Did you know there's a tunnel under Ocean Boulevard? is the title of Lana del Rey's latest album, referring to the…

11 条评论
Mastering Text Alignment

2023年11月23日

Mastering Text Alignment

Text alignment of multilingual documents is one of the most widely used procedures in our industry. Alignment is a…

15 条评论
Creating Functional TMX Files from Bilingual Documents with Tag Preservation

2023年11月7日

Creating Functional TMX Files from Bilingual Documents with Tag Preservation

Let's imagine that our client provides us with an Excel file with Column A in English and Column B containing the…

5 条评论
Decoding the Role of Encoding in the Localization Industry

2023年11月3日

Decoding the Role of Encoding in the Localization Industry

Localization Engineering at the Crossroads of Global Communication Localization is the dynamic process of adapting…

4 条评论
PO Files: Versatile and Human-Readable Powerhouses for Localization

2023年10月30日

PO Files: Versatile and Human-Readable Powerhouses for Localization

In the vast landscape of localization, where the seamless adaptation of content and software to various languages and…

23 条评论
JSON Files: A Flexible and Lightweight Format for Localization

2023年10月3日

JSON Files: A Flexible and Lightweight Format for Localization

If you work in the localization industry, you are undoubtedly familiar with a plethora of file formats designed to…

22 条评论
Empowering Localization: Unveiling the Power of Stop Word Removal and Term Extraction

2023年8月28日

Empowering Localization: Unveiling the Power of Stop Word Removal and Term Extraction

Introduction In a world where connection knows no boundaries, the power of localization has risen to the forefront of…

3 条评论

See all articles

Converting Paragraph-Based TMX to Sentence-Based Segmentation