Harnessing the Power of FME and AI to Translate Our Training Material: A Tutorial
Context
Context: Over an extended period, our training materials have primarily been in English. Given the strong language proficiency among Finland's data-oriented workforce, this hasn't posed significant issues. It's worth noting that Finland has two official languages, Swedish and Finnish, along with at least two other Sami languages spoken by indigenous communities in the northernmost regions of Europe.
However, this year, we made a strategic decision to allocate resources to create Finnish-language versions of our general training materials. What initially seemed like a straightforward process—comprising two phases: "Translate in Word" followed by "Petri, could you please check?"—unfolded into a more intricate and unexpectedly enjoyable endeavor, partly due to the limitations encountered with Word.
This realization serves as the catalyst for the tutorial ahead, where we delve into the process of constructing a pipeline to dynamically update a Word document by leveraging open-source machine learning tools.
Approach
When traditional methods failed, we turned to FME and open-source translation models for a solution.
To make things interesting, we aimed to preserve the document's layout while replacing its text.
Here are the main steps:
Docx is XML and FME loves XML : Text extraction
Unzip the DOCX file and extract the "t" elements from the XML tree. This required only a few transformers.
Translation: Essential Yet Tedious Data Preparation
Now that we've obtained the text elements for translation, we streamline the process by exporting them into a temporary CSV file. This enhances the robustness and modularity of our workflow, facilitating interaction with external tools for testing and batch processing.
The Trendy/Bubble Part: AI Translation Using a Locally Based ML Model
While there are numerous APIs available for translating text elements at this stage, we opted to delve deeper into the latest ML models and toolsets. For this purpose, we decided to install a model developed by the University of Helsinki, accessible through the Hugging Face portal: https://huggingface.co/Helsinki-NLP/opus-mt-en-fi.
Initial Impressions
The landscape of machine learning has evolved significantly in terms of accessibility. Platforms like Hugging Face not only provide documentation but also offer opportunities to test models, download datasets, and use test instances, marking a considerable advancement from just a few years ago with tools like OpenCV or TensorFlow.
Implementation
Installation
Following the instructions provided by Hugging Face, setting up the initial script is remarkably straightforward.
领英推荐
python -m venv /pathtoenvironment
Note: Be mindful of the Python version required by the dependencies. Additionally, if you plan to utilize GPU acceleration, ensure compatibility with your Python version and CUDA drivers. However, such compatibility issues are far less common than they were a few years ago. We used Python 3.11.8 for our setup.
Adaptation
Modify the test Python script to read a CSV file containing text elements, translate them, and generate a new CSV file that includes the translated elements along with their corresponding IDs. This ensures you can easily update the original content later.
Please contact us if you want the full working example code.
Streamlining this process significantly simplifies testing the translation independently. We can seamlessly integrate it with FME now.
FME to Call the AI Tool: The Powerful SystemCaller
FME, renowned for its versatility akin to a Swiss army knife, shines particularly in its capability to invoke external tools effortlessly. In our scenario, we leveraged SystemCaller to execute our Python tool within a virtual environment. If anyone possesses insights on accomplishing this directly through PythonCaller, we welcome their input.
rem Activate the virtual environment
cd /d D:\
cd ai2
call mynewenv\Scripts\activate.bat
echo Virtual environment activated.
rem Call the Python script with parameters
python test03042024.py %1 %2 %3 %4 %5
rem Deactivate virtual environment
deactivate
Updating the Word Document
With our translated text elements now stored in a CSV, the next step involves updating the XML and zipping the result.
Updating the XML
Let's proceed with updating the document.xml file:
def update_xml_with_translations(xml_path,translation_mappings):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(xml_path, parser)
root = tree.getroot()
for xml_id, translated_text in translation_mappings.items():
# Find the element by its XML ID
print(xml_id)
elem = find_element_by_id(root, xml_id)
if elem is not None:
elem.text = translated_text # Update the text content
print(elem.text)
tree.write(xml_path, pretty_print=True, encoding='utf-8', xml_declaration=True)
def find_element_by_id(root, xml_id):
# Split the xml_id into indices
indices = xml_id.split('.')[2:]
# Traverse the XML tree to find the element
elem = root.find('.//w:body', namespaces=root.nsmap)
for index in indices:
print(elem)
index = int(index) - 1
try:
elem = elem[index]
except (IndexError, TypeError):
print("error")
return None # Element not found
return elem
Zipping the DOCX
We're nearing the final stage! To zip our folder, we'll utilize basic DOS commands through SystemCaller, with FME aiding in naming and parameterization.
The primary commands, with "&&" allowing concatenation of steps, are as follows:
Check the result!!
Congratulations! With the zipping process completed, you can now open your Word file and marvel (and/or laugh) at the translated content.
Main Points on the Output Quality:
Conclusion
In this tutorial and reflection on our experience, you've witnessed the fusion of open-source machine learning and FME for contemporary translation tasks. More broadly, it offers insights into crafting data transformation pipelines that integrate FME with cutting-edge tools locally, eliminating the dependency on APIs.
The emergence of AI project-sharing platforms such as Hugging Face, coupled with the availability of open models from esteemed research institutions like Helsinki University, has democratized experimentation with modern tools. The innovation loop has never been shorter. Integrating these advancements into your processes and products is streamlined with FME, a remarkable tool facilitating the rapid construction of robust proofs of concept. These are comprehensible to non-coders, thanks to clear, reusable blocks of work that can be expanded into more generic use cases.
Business Director, UK
11 个月Such a great use case and the step by step breakdown is awesome, great work and thanks for sharing. One option that came out a few weeks ago is a Finnish open source large language model - https://ollama.com/osoderholm/poro this runs on Ollama which you can use the LocalGenerativeAICaller transformer (https://hub.safe.com/publishers/tensing/transformers/localgenerativeaicaller). This way you wont need to use the systemcaller and intermediatory files. *You will* need to run on at least 32GB of RAM but I think this could produce a really good translation from English to Finnish. Ping me if you need a hand setting up.