Open-Source Tool Helps Users Anonymize Data in 24 EU Languages

Suky Gu

Translation | Localization | DTP | Transcription | Voice over | Subtitling

发布日期: 2021年7月12日

The first results of a data anonymization project were released in May 2021 as a?pre-beta online demo. Called “MAPA” (Multilingual Anonymization toolkit for Public Administrators), the project aims to help EU public administrators share data while staying compliant with data regulations.?

MAPA is led by language service provider (LSP) Pangeanic, which?was awarded EUR 1m?in funding by the European Commission’s Innovation and Networks Executive Agency (INEA) in January 2020. Pangeanic is working alongside a number of partners including the National French Center for Scientific Research, LSP Tilde, language resource center ELRA, and the University of Malta.

Using AI processing of Named Entity Recognition (NER), the tool identifies personal details in line with the EU’s General Data Protection Regulation (GDPR). Data such as names, credit card numbers, dates, and professions are anonymized. Entering the English sentence “Rosalind Franklin was born on 25 July 1920,” for instance, will return “******* ******** was born on ** **** ****.”

The tool is available in all 24 official EU languages and focuses on the legal and medical domains. The beta version will be released in June 2021, while the final toolkit will be available later in the year for several use cases.

The tool will be downloadable with a fully deployable “docker” and an open-source license. (A docker is a product that wraps around software, ensuring data security and enabling connection with other software without the need for custom APIs.) Once released, users will be able to incorporate the tool into their own processes by building on existing code.

Slator 2021 Data-for-AI Market Report

DATA AND RESEARCH, SLATOR REPORTS

44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.

Dual Needs: Transparency and Compliance

The MAPA project arose as a means to address the dilemma: How can public administrators share data across public bodies and borders within the EU while protecting EU citizens’ data?

Manuel Herranz, CEO of Pangeanic, told Slator, “EU administrators suffer from a double mandate. They must be seen to offer transparency in the way data is shared across the EU while also complying with GDPR.”

Leonard Park 1 年前

Leveraging Attention Mechanisms in Language Models to…

ELLE legal solution 3 个月前

Wikipedia-Like Platform for LLM Human Evaluation

Sam Shamsan 9 个月前

A tool such as MAPA, which reliably removes personal details in all EU languages, will pave the way for EU administrations to benefit from big data by, for example, sharing large datasets for machine learning purposes. The first major use case, according to Herranz, will be European Complaints Watch, which will be provided with a locally-run data anonymization service per EU country.

Languages Learn From Other Languages

While the pre-beta version was developed within a year of project launch, the MAPA initiative has not been without its challenges. Covid-19 disrupted a plan to focus equally on legal and medical texts. “EU health authorities are already stressed so we’ve ended up focusing more on the legal domain,” Herranz said.

Slator 2021 Language Industry Market Report

DATA AND RESEARCH, SLATOR REPORTS

80-pages. Market Size by Vertical, Geo, Intention. Expert-in-Loop Model. M&A. Frontier Tech. Hybrid Future. Outlook 2021-2025.

On the plus side, the AI-based tool revealed an intriguing ability: languages can learn from other languages. Herranz explained, “That’s the beauty of neural networks. We found that by mixing everything together in one large multilingual stew, the tool could recognize entities in languages for which it had not been trained.”

The finding gave the MAPA team an advantage. Low resource languages could be trained to a reasonable level of accuracy using general multilingual data, then topped up with targeted data to enhance quality. “Maltese already ran very well when we had no Maltese network; and then by adding Maltese data, we were able to really fine-tune the results,” Herranz added.

So far, reaction to the pre-beta release has been positive. Herranz said, “We’ve received very good comments saying it’s working very well in Latvian, Spanish, and French.”?

However, the project is still a work in progress. The MAPA team is shooting for an accuracy target of above 95%; and, while some languages are performing at 98%, others are sitting at around the 89% mark. According to Herranz, “Results are pretty promising in most languages but there is still more work to do.”

Open-Source Tool Helps Users Anonymize Data in 24 EU Languages

Suky Gu

Translation | Localization | DTP | Transcription | Voice over | Subtitling

Slator 2021 Data-for-AI Market Report

Dual Needs: Transparency and Compliance

领英推荐

Languages Learn From Other Languages

Slator 2021 Language Industry Market Report

更多精彩文章

社区洞察

其他会员也浏览了

Data Sharing Contracts for LLMs

Claude conversation series: persuasiveness and open-weights

Expert LLMs: Realizing the Potential of Large Language Models in Business

Redefining Networks: The Rise of Large Language Models in Telecom, with a Focus on Open RAN

Presenting Narralegal, the largest language model for legal texts in Spanish

ChatGDPR is live!

Hitting the Limit of Large Language Models: The Extraction of Ratio Decidendi

Multilingual AI: What to think about before you use it.

The Path Forward with Sovereign LLMs

Slator 2021 Data-for-AI Market Report

Dual Needs: Transparency and Compliance

领英推荐

Languages Learn From Other Languages

Slator 2021 Language Industry Market Report

Why Video Game Localization Services？

2023年10月15日

Translation Through the Ages: Tracing the Evolution of a Critical Skill

2023年6月30日

Medical Interpreting Services: A Vital Component of Healthcare Delivery

2023年6月30日

The Importance of Consistency in Diplomatic Translation

2023年6月28日

The Translation of Various Famous Teas in China

2023年6月23日

Classification of Chinese Tea

2023年6月21日

The Wonders of the Sky: Exploring Different Types of Clouds

2023年6月10日

Bridges in China: Connecting History and Culture

2023年6月9日

Discovering Chinese Poetry: History, Forms, and Themes

2023年6月7日

Discovering the Art and Philosophy of Chinese Kung Fu: A Journey Through the Centuries

2023年6月3日

社区洞察

其他会员也浏览了

Data Sharing Contracts for LLMs

Claude conversation series: persuasiveness and open-weights

Expert LLMs: Realizing the Potential of Large Language Models in Business

Redefining Networks: The Rise of Large Language Models in Telecom, with a Focus on Open RAN

Presenting Narralegal, the largest language model for legal texts in Spanish

ChatGDPR is live!

Hitting the Limit of Large Language Models: The Extraction of Ratio Decidendi

Multilingual AI: What to think about before you use it.

The Path Forward with Sovereign LLMs