Converting Paragraph-Based TMX to Sentence-Based Segmentation
Wrong sliced pizza

Converting Paragraph-Based TMX to Sentence-Based Segmentation

Performing a resegmentation of a previously segmented translation memory using a different criterion is not an easy task, nor is it 100% reliable. It requires necessary checks to ensure that we achieve our objective, and if the resegmentation is not fully applied, some manual work may be required. The first step is to open our translation memory in a rich text editor (such as Notepad++), and verify that our memory is indeed segmented by paragraph by examining the "segtype" element in the header and checking some translation units stored in the memory:

Paragraph-based segmentation


We can also perform a double check of the segmentation by opening our translation memory in a TMX editor (such as Heartsome TMX Editor):

Paragraph-based segmentation in Heartsome TMX Editor


Method 1: Alignment

Most CAT (Computer-Assisted Translation) tools come equipped with a semi-automatic alignment tool that allows us to create translation memories from multilingual corpora to leverage previous work for future tasks. For this demonstration, I will be using the alignment tool provided by OmegaT, but you can use any other tool such as memoQ, Matecat, Okapi Framework, Trados, Wordfast Autoaligner, among others.

The first step is to clean our raw translation memory, which means obtaining a document with only the source language segments and another document with the target language segments. To do this, we will use a rich text editor and a pair of regular expressions (regex). The regex will depend on the structure of our memory.

First, open the translation memory in Notepad++ and adjust the regex to extract only the source language segments:

Regex to extract source segments


Copy the marked text and paste it into a new file:

Resulting extraction of source segments


Next, we remove all unnecessary XML tags to leave our file with only the text of the translation unit and its inline tags. We can do this by performing a Search and Replace operation using regex again:

Clean extra XML tags


Finally, we do the same with the target language translation units and obtain two documents like this:

Source and target text extracted and cleaned up side by side


This procedure can be long and tedious, but it is the way to ensure that our tags in the translation memory will be preserved after alignment and it's essencial for an optimal leverage.

Now let's get into action. We open our aligner (as I mentioned, I will use the one from OmegaT, but you can use the one that suits you best). The most important part of this step is to adjust the segmentation rules of our CAT tool/aligner to obtain the desired segmentation:

OmegaT Aligner


We load the two documents into the tool and perform an auto-alignment. We then check the result and make the necessary adjustments to ensure that our memory is perfectly segmented and aligned:

Segmentation rules for alignment


Alignment confirmation and TMX creation


And voilà! Our translation memory is now perfectly resegmented by sentence, whereas before it was segmented by paragraph:

Resulting TMX in Notepad++


Resulting TMX in Heartsome TMX Editor


Method 2: Forced Resegmentation with SRX Rules

For the second method, we will use the Rainbow application from the Open Source localization engineering suite Okapi Framework. This method is much faster than the previous one as it eliminates the manual preprocessing of raw translation memories. The procedure is straightforward and guided, and it involves loading the TMX file with the appropriate filter, adjusting it to our needs, converting that memory into an XLIFF file with source and target segmentation based on the SRX standard, adjusting it to the desired final segmentation (by sentence), and then converting that resulting XLIFF into a new TMX file with the new segmentation.

Let's begin. Open Rainbow and add the translation memory in the Input 1 tab. Select the appropriate filter (TMX, in this case):

TMX loaded in Rainbow


We proceed to create the XLIFF file from our translation memory by following the following path of options:

Creation of XLIFF file out of TMX


When entering the translation package creation process, we encounter a screen to configure the new package we are going to create. First, we set how we want the segmentation to be. This is where we configure the resegmentation of the file by choosing SRX segmentation rules and forcing the resegmentation for both the source and target:

Resegmentation options


Please note that, at this stage, the most important aspect is defining the segmentation rules. They need to be clear and concise to ensure bidirectional segmentation (source and target). If you're unsure, you can click on the "Edit..." button once your rules are loaded, and it will open an SRX editor with your rules already loaded and real-time preview editing (Ratel):

Ratel SRX rules editor


Now, select the option and configure the output file (XLIFF in our case):

XLIFF creation


Upon executing the action, a package is created, and within it, we find a folder called "Work" where our XLIFF file is located. The XLIFF file contains the source text stored in our translation memory, now segmented by sentence, as well as the segmented target text:

Folder structure after XLIFF creation


Open the resulting XLIFF file in a rich text editor and verify that the text has been segmented according to our preferences:

Sentence-based resegmented XLIFF


And indeed, it has been done. It is worth noting that the segmented text in an XLIFF 1.2 file is displayed as follows: for the source, it is enclosed within <seg-source> tags and delimited by <mrk> tags, while for the target, it is directly enclosed within <mrk> tags within <target> tags.

We can also open the XLIFF file in our CAT tool to verify the segmentation:

Sentence-based resegmented XLIFF in XLIFF editor (memoQ)


To conclude, we load the resulting XLIFF file back into Rainbow, in the Input 1 tab, and the XLIFF filter will be automatically selected:

XLIFF conversion


Open the menu Utilities > Conversion Utilities > File Format Conversion...

File format conversion


Finally, in the configuration window that appears, select the following options and execute them:

TMX creation out of XLIFF file


A TMX file will be created alongside the converted XLIFF file, but it may contain some unnecessary information in the prop attributes:

Resulting TMX with dirty content


That can be cleaned up using the following regex in a Search and Replace operation:

Cleaning up dirty TMX


And manually change the "segtype" from "paragraph" to "sentence".

Now we will have our translation memory resegmented by sentence, while preserving our tags, fully optimized for the intended use:

Final sentence-based segmented TMX file


Final sentence-based segmented TMX file in Heartsome TMX Editor





Sindhura BL

Looking for a full-time job as a Test Automation Engineer.

1 个月

I have created segments using sentence type, now how to convert it to paragraph based segmentation.

回复
Kevin Lossner

Independent Writer and Language Services Teacher

4 个月

Food for thought here, thank you. Now I'll try to translate all that into a process with my usual toolset. It's been too long since I messed with the Rainbow suite.

Sergio Alasia

Perfecting brand communication across 15 languages, one word at a time | Language technology specialist | Entrepreneur

9 个月

Hi Víctor, thank you for your contribution, it was very helpful. I'd like to suggest a slight improvement to the 2nd method, though. There is no need to convert into XLIFF and back to TMX. You can set up a simple custom pipeline like you see in the pic to process the TMX file in Rainbow. Afterwards you'll still need to delete the <prop> tags, but there are a lot fewer.

  • 该图片无替代文字
Anna Ward

Head of Localization @ Unite | Freelance Translator German to English | Freelance Proofreader & Copyeditor for English

12 个月

We love your posts over at Unite's loc team! Alejandro, right?

Robert Lo Bue

?? Operations ?? BioTech

12 个月

Excellent post Victor! Really helpful for a lot of people I am sure!

要查看或添加评论,请登录

Víctor Parra的更多文章

社区洞察

其他会员也浏览了