Converting Paragraph-Based TMX to Sentence-Based Segmentation
Performing a resegmentation of a previously segmented translation memory using a different criterion is not an easy task, nor is it 100% reliable. It requires necessary checks to ensure that we achieve our objective, and if the resegmentation is not fully applied, some manual work may be required. The first step is to open our translation memory in a rich text editor (such as Notepad++), and verify that our memory is indeed segmented by paragraph by examining the "segtype" element in the header and checking some translation units stored in the memory:
We can also perform a double check of the segmentation by opening our translation memory in a TMX editor (such as Heartsome TMX Editor):
Method 1: Alignment
Most CAT (Computer-Assisted Translation) tools come equipped with a semi-automatic alignment tool that allows us to create translation memories from multilingual corpora to leverage previous work for future tasks. For this demonstration, I will be using the alignment tool provided by OmegaT, but you can use any other tool such as memoQ, Matecat, Okapi Framework, Trados, Wordfast Autoaligner, among others.
The first step is to clean our raw translation memory, which means obtaining a document with only the source language segments and another document with the target language segments. To do this, we will use a rich text editor and a pair of regular expressions (regex). The regex will depend on the structure of our memory.
First, open the translation memory in Notepad++ and adjust the regex to extract only the source language segments:
Copy the marked text and paste it into a new file:
Next, we remove all unnecessary XML tags to leave our file with only the text of the translation unit and its inline tags. We can do this by performing a Search and Replace operation using regex again:
Finally, we do the same with the target language translation units and obtain two documents like this:
This procedure can be long and tedious, but it is the way to ensure that our tags in the translation memory will be preserved after alignment and it's essencial for an optimal leverage.
Now let's get into action. We open our aligner (as I mentioned, I will use the one from OmegaT, but you can use the one that suits you best). The most important part of this step is to adjust the segmentation rules of our CAT tool/aligner to obtain the desired segmentation:
We load the two documents into the tool and perform an auto-alignment. We then check the result and make the necessary adjustments to ensure that our memory is perfectly segmented and aligned:
And voilà! Our translation memory is now perfectly resegmented by sentence, whereas before it was segmented by paragraph:
Method 2: Forced Resegmentation with SRX Rules
For the second method, we will use the Rainbow application from the Open Source localization engineering suite Okapi Framework. This method is much faster than the previous one as it eliminates the manual preprocessing of raw translation memories. The procedure is straightforward and guided, and it involves loading the TMX file with the appropriate filter, adjusting it to our needs, converting that memory into an XLIFF file with source and target segmentation based on the SRX standard, adjusting it to the desired final segmentation (by sentence), and then converting that resulting XLIFF into a new TMX file with the new segmentation.
Let's begin. Open Rainbow and add the translation memory in the Input 1 tab. Select the appropriate filter (TMX, in this case):
We proceed to create the XLIFF file from our translation memory by following the following path of options:
领英推荐
When entering the translation package creation process, we encounter a screen to configure the new package we are going to create. First, we set how we want the segmentation to be. This is where we configure the resegmentation of the file by choosing SRX segmentation rules and forcing the resegmentation for both the source and target:
Please note that, at this stage, the most important aspect is defining the segmentation rules. They need to be clear and concise to ensure bidirectional segmentation (source and target). If you're unsure, you can click on the "Edit..." button once your rules are loaded, and it will open an SRX editor with your rules already loaded and real-time preview editing (Ratel):
Now, select the option and configure the output file (XLIFF in our case):
Upon executing the action, a package is created, and within it, we find a folder called "Work" where our XLIFF file is located. The XLIFF file contains the source text stored in our translation memory, now segmented by sentence, as well as the segmented target text:
Open the resulting XLIFF file in a rich text editor and verify that the text has been segmented according to our preferences:
And indeed, it has been done. It is worth noting that the segmented text in an XLIFF 1.2 file is displayed as follows: for the source, it is enclosed within <seg-source> tags and delimited by <mrk> tags, while for the target, it is directly enclosed within <mrk> tags within <target> tags.
We can also open the XLIFF file in our CAT tool to verify the segmentation:
To conclude, we load the resulting XLIFF file back into Rainbow, in the Input 1 tab, and the XLIFF filter will be automatically selected:
Open the menu Utilities > Conversion Utilities > File Format Conversion...
Finally, in the configuration window that appears, select the following options and execute them:
A TMX file will be created alongside the converted XLIFF file, but it may contain some unnecessary information in the prop attributes:
That can be cleaned up using the following regex in a Search and Replace operation:
And manually change the "segtype" from "paragraph" to "sentence".
Now we will have our translation memory resegmented by sentence, while preserving our tags, fully optimized for the intended use:
Looking for a full-time job as a Test Automation Engineer.
1 个月I have created segments using sentence type, now how to convert it to paragraph based segmentation.
Independent Writer and Language Services Teacher
4 个月Food for thought here, thank you. Now I'll try to translate all that into a process with my usual toolset. It's been too long since I messed with the Rainbow suite.
Perfecting brand communication across 15 languages, one word at a time | Language technology specialist | Entrepreneur
9 个月Hi Víctor, thank you for your contribution, it was very helpful. I'd like to suggest a slight improvement to the 2nd method, though. There is no need to convert into XLIFF and back to TMX. You can set up a simple custom pipeline like you see in the pic to process the TMX file in Rainbow. Afterwards you'll still need to delete the <prop> tags, but there are a lot fewer.
Head of Localization @ Unite | Freelance Translator German to English | Freelance Proofreader & Copyeditor for English
12 个月We love your posts over at Unite's loc team! Alejandro, right?
?? Operations ?? BioTech
12 个月Excellent post Victor! Really helpful for a lot of people I am sure!