Harnessing the Power of FME and AI to Translate Our Training Material: A Tutorial

Harnessing the Power of FME and AI to Translate Our Training Material: A Tutorial

Context

Context: Over an extended period, our training materials have primarily been in English. Given the strong language proficiency among Finland's data-oriented workforce, this hasn't posed significant issues. It's worth noting that Finland has two official languages, Swedish and Finnish, along with at least two other Sami languages spoken by indigenous communities in the northernmost regions of Europe.

However, this year, we made a strategic decision to allocate resources to create Finnish-language versions of our general training materials. What initially seemed like a straightforward process—comprising two phases: "Translate in Word" followed by "Petri, could you please check?"—unfolded into a more intricate and unexpectedly enjoyable endeavor, partly due to the limitations encountered with Word.

This realization serves as the catalyst for the tutorial ahead, where we delve into the process of constructing a pipeline to dynamically update a Word document by leveraging open-source machine learning tools.


Approach

When traditional methods failed, we turned to FME and open-source translation models for a solution.

To make things interesting, we aimed to preserve the document's layout while replacing its text.

Here are the main steps:

  1. Extract text elements and their positions from the Word document's XML tree (YES Word docx is xml and we all love XML, don't we?).
  2. Export text to CSV for translation.
  3. Translate text using AI.
  4. Update the original DOCX file with translated content.


Docx is XML and FME loves XML : Text extraction

Unzip the DOCX file and extract the "t" elements from the XML tree. This required only a few transformers.

  • TempPathnameCreator : Create a temporary folder.
  • ZipExtractorAnyExtension : Unzip the content with this transformer, developed by Takashi Iijima, a pivotal figure in the FME community and known for his insightful blog on FME and Python integration. (Special mention also to Don Murray for his contributions.)
  • Tester : Identify and select the "document.xml" file among the numerous files extracted.
  • FeatureReader : Extract the text "t" element thanks to its remarkable tree discovery tool and get their positions in the tree.


  • VariableSetter : Store the path of the temporary folder for future updates.

Translation: Essential Yet Tedious Data Preparation

Now that we've obtained the text elements for translation, we streamline the process by exporting them into a temporary CSV file. This enhances the robustness and modularity of our workflow, facilitating interaction with external tools for testing and batch processing.


  • We create a separate temporary folder to avoid confusion.
  • To set the location for the FeatureWriter, we use an FME trick: the FeatureMerger with a 1:1 parameter.

  • AttributeRenamer allows us to customize attribute names.

  • Finally, we write the CSV with the FeatureWriter, ensuring to remove the Byte Order Mark (BOM) for proper formatting.

The Trendy/Bubble Part: AI Translation Using a Locally Based ML Model

While there are numerous APIs available for translating text elements at this stage, we opted to delve deeper into the latest ML models and toolsets. For this purpose, we decided to install a model developed by the University of Helsinki, accessible through the Hugging Face portal: https://huggingface.co/Helsinki-NLP/opus-mt-en-fi.

Initial Impressions

The landscape of machine learning has evolved significantly in terms of accessibility. Platforms like Hugging Face not only provide documentation but also offer opportunities to test models, download datasets, and use test instances, marking a considerable advancement from just a few years ago with tools like OpenCV or TensorFlow.

Implementation

Installation

Following the instructions provided by Hugging Face, setting up the initial script is remarkably straightforward.

  • First, create a dedicated virtual environment using Python's built-in venv module:

 python -m venv /pathtoenvironment        

  • Refer to their documentation to obtain a test script and familiarize yourself with the tool's functionality.

Note: Be mindful of the Python version required by the dependencies. Additionally, if you plan to utilize GPU acceleration, ensure compatibility with your Python version and CUDA drivers. However, such compatibility issues are far less common than they were a few years ago. We used Python 3.11.8 for our setup.

Adaptation

Modify the test Python script to read a CSV file containing text elements, translate them, and generate a new CSV file that includes the translated elements along with their corresponding IDs. This ensures you can easily update the original content later.

Please contact us if you want the full working example code.

Streamlining this process significantly simplifies testing the translation independently. We can seamlessly integrate it with FME now.

FME to Call the AI Tool: The Powerful SystemCaller

FME, renowned for its versatility akin to a Swiss army knife, shines particularly in its capability to invoke external tools effortlessly. In our scenario, we leveraged SystemCaller to execute our Python tool within a virtual environment. If anyone possesses insights on accomplishing this directly through PythonCaller, we welcome their input.

  • Prepare a batch file outlining the necessary steps to execute your tool. Once again, tools like Chat-GPT can expedite this process. Below is a shortened example of the translate.bat file we generated:

rem Activate the virtual environment
cd /d D:\
cd ai2
call mynewenv\Scripts\activate.bat
echo Virtual environment activated.

rem Call the Python script with parameters
python test03042024.py %1 %2 %3 %4 %5

rem Deactivate virtual environment
deactivate        

  • Invoke the batch file or PowerShell script through SystemCaller, leveraging FME's variables for enhanced flexibility.

Updating the Word Document

With our translated text elements now stored in a CSV, the next step involves updating the XML and zipping the result.

Updating the XML

Let's proceed with updating the document.xml file:

  • Read the CSV containing the translated elements. Utilizing FME for this step enhances testing, data quality checks, and accommodates special use cases.

  • Retrieve the path for the document.xml file with VariableRetriever.
  • Employ PythonCaller to map the XML tree elements and perform the update. Below is a snippet of code to facilitate this task. These libraries are already installed in the FME Python folder. Below the main functions, we integrated them with FME objects.

 def update_xml_with_translations(xml_path,translation_mappings):
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.parse(xml_path, parser)
    root = tree.getroot()

    for xml_id, translated_text in translation_mappings.items():
        # Find the element by its XML ID
        print(xml_id)
        elem = find_element_by_id(root, xml_id)

        if elem is not None:
            elem.text = translated_text  # Update the text content
            print(elem.text)

    tree.write(xml_path, pretty_print=True, encoding='utf-8', xml_declaration=True)

def find_element_by_id(root, xml_id):
    # Split the xml_id into indices
    indices = xml_id.split('.')[2:]
    # Traverse the XML tree to find the element
    elem = root.find('.//w:body', namespaces=root.nsmap)
    for index in indices:
        print(elem)
        index = int(index) - 1
        try:
            elem = elem[index]
        except (IndexError, TypeError):
            print("error")
            return None  # Element not found
    return elem        

Zipping the DOCX

We're nearing the final stage! To zip our folder, we'll utilize basic DOS commands through SystemCaller, with FME aiding in naming and parameterization.

The primary commands, with "&&" allowing concatenation of steps, are as follows:

  • xcopy: This copies your temporary data to a folder of your preference.
  • powershell Compress-Archive: This compresses your folder. Be mindful of the "*" to prevent the creation of a subfolder, which could potentially disrupt your DOCX.
  • rename: This renames your ZIP file with the ".docx" extension.

Check the result!!

Congratulations! With the zipping process completed, you can now open your Word file and marvel (and/or laugh) at the translated content.


Main Points on the Output Quality:

  • Style and Layout Integrity: We maintain the original document's style and layout to ensure consistency in appearance.
  • Untranslated Titles and Tables: Titles and table content remain untranslated as they're not stored in "t" elements, necessitating further optimization.
  • Translation Oddities: Some translations may appear childlike, resembling the language proficiency of young Finnish speakers. This is due to paragraph segmentation into short "runs," which FME can help address.
  • Performance Concerns: The inference phase, particularly noticeable without a high-performance GPU, can be slow with this model.

Conclusion

In this tutorial and reflection on our experience, you've witnessed the fusion of open-source machine learning and FME for contemporary translation tasks. More broadly, it offers insights into crafting data transformation pipelines that integrate FME with cutting-edge tools locally, eliminating the dependency on APIs.

The emergence of AI project-sharing platforms such as Hugging Face, coupled with the availability of open models from esteemed research institutions like Helsinki University, has democratized experimentation with modern tools. The innovation loop has never been shorter. Integrating these advancements into your processes and products is streamlined with FME, a remarkable tool facilitating the rapid construction of robust proofs of concept. These are comprehensible to non-coders, thanks to clear, reusable blocks of work that can be expanded into more generic use cases.



Oliver Morris

Business Director, UK

11 个月

Such a great use case and the step by step breakdown is awesome, great work and thanks for sharing. One option that came out a few weeks ago is a Finnish open source large language model - https://ollama.com/osoderholm/poro this runs on Ollama which you can use the LocalGenerativeAICaller transformer (https://hub.safe.com/publishers/tensing/transformers/localgenerativeaicaller). This way you wont need to use the systemcaller and intermediatory files. *You will* need to run on at least 32GB of RAM but I think this could produce a really good translation from English to Finnish. Ping me if you need a hand setting up.

回复

要查看或添加评论,请登录

Spatialworld Oy的更多文章

社区洞察

其他会员也浏览了