AI Meets Culture: TunDerja – Tunisian Derja to English Translator

AI Meets Culture: TunDerja – Tunisian Derja to English Translator

One of the things I often encounter as a Tunisian is the assumption that speaking a few Arabic words like "Marhaba" or "Kifek" equates to speaking Tunisian. While it's true that Arabic is the official language across the MENA region, each country has its own dialect, and ours is uniquely Tunisian Derja. Derja, like Moroccan Derja, has its own set of phrases, idioms, and nuances that make it distinct from other dialects. In fact, there are times when Tunisians and Moroccans may struggle to fully understand each other due to the differences in our dialects, even though we share many cultural and linguistic roots.

This diversity in dialects is what makes our region so special. It’s a reflection of the many civilizations and cultures that have influenced Tunisia over the centuries, from the Phoenicians to the Romans to the Ottomans. Tunisian Derja is not just a dialect; it’s a cultural heritage that carries within it the stories and histories of our people. This realization further fueled my desire to create an AI translator that could preserve and promote the beauty of Tunisian Derja, showcasing the depth and diversity of our language.


The Spark: Why I Trained an AI Model for Tunisian Derja

As an AI enthusiast and professional, I’ve always been fascinated by how technology can solve real-world problems, especially when it comes to bridging communication gaps. The lack of an effective and accessible tool for translating Tunisian Derja into English became increasingly apparent as I interacted with people from different cultures. People would often approach me with Arabic words like “Marhaba” or “Kifek,” thinking they were speaking Tunisian. But these words don’t represent the unique richness of our dialect.

I noticed that many existing AI translation models catered to Modern Standard Arabic (MSA) or other dialects, but none were truly tailored to Tunisian Derja. This gap in the AI space inspired me to take action. My mission became clear: to develop a translation model that could handle the nuances of Tunisian Derja and accurately translate it into English. This would not only make our dialect more accessible but also contribute to preserving our linguistic heritage in the digital age.


The Journey Begins: Dataset Gathering and Exploration

Building an AI translator for a dialect like Tunisian Derja requires data—lots of it. The first step in my journey was to gather a dataset that accurately represented the diversity and nuances of everyday Tunisian speech. But, as anyone who has worked with AI models knows, finding well-structured datasets for niche dialects is no easy task.

I began by scouring the internet, social media platforms, and various NLP repositories, looking for any data related to Tunisian Arabish, Tunisian Derja, and Tunisian-to-English translations. I came across several datasets, including the Tunisian Arabish Corpus (TArC) and others that were a combination of Tunisian Derja written in Latin characters (often used on social media) and Arabic script.

However, the data I found wasn’t enough. It was often scattered across platforms, inconsistent, and lacked the conversational depth I needed. So, I decided to create my own custom dataset, compiling phrases, idiomatic expressions, and common translations between Tunisian Derja and English. This process involved manually curating, cleaning, and structuring the data so that it would be ready for training.


Exploring Platforms: From Hugging Face to OpenAI

Once I had my dataset, the next challenge was choosing the right platform to train the model. I explored several options, starting with Hugging Face, which offers great flexibility for fine-tuning models and managing large-scale datasets. Hugging Face provided some valuable tools, but I realized that training a translation model for a chat-based experience—one that could handle contextual, conversational translations—would require a more specialized environment.

This led me to experiment with different models and approaches, testing GPT-Neo on Hugging Face, and attempting custom fine-tuning with OpenAI’s API. Each platform had its strengths, but my goal was to find one that could handle Tunisian Derja's conversational nature, understand idiomatic expressions, and produce accurate and context-aware translations.

Finally, I found the perfect solution: GPT-4o-mini from OpenAI. It was powerful yet lightweight, making it ideal for conversational translations. OpenAI also provided an intuitive fine-tuning process, which allowed me to easily upload my dataset, configure hyperparameters, and monitor the training progress.


Fine-tuning the model using open-ai

Challenges and Lessons: Fine-Tuning for a Chat-Based Model

Once I settled on GPT-4o-mini, the real work began. Fine-tuning a model isn’t just about feeding it data—it’s about making sure the data is structured in a way that the model can effectively learn from. Since GPT-4o-mini is a chat-based model, I had to format my dataset into a conversational structure, simulating real interactions between a user asking for a translation and the AI providing the answer.

This meant creating a dataset where each conversation looked something like this:

{
  "messages": [
    {"role": "system", "content": "You are a translator from Tunisian Derja to English."},
    {"role": "user", "content": "???? ?????? ???? ??????"},
    {"role": "assistant", "content": "I have a meeting tomorrow morning."}
  ]
}        

After structuring the dataset, I encountered several challenges—errors in file formats, misconfigurations in the fine-tuning settings, and the inevitable trial and error of tweaking learning rates, batch sizes, and epochs. But each obstacle provided valuable insights into how AI models interpret language, especially when dealing with a rich and complex dialect like Tunisian Derja.


Success! Building a Tunisian Derja AI Translator

Finally, after multiple iterations and improvements, I had a working Tunisian Derja to English AI translator. Seeing the model accurately translate phrases like "?? ????? ?????? ?????????" into "Grab the rope well and don't let go" was incredibly satisfying. It was a testament to the power of AI and how it can be adapted to preserve and enhance local languages and dialects.


TunDerja, Translating a common phrase used in Tunisian daily communication

https://ellemediaempire.com/tunisian-derja/

Through this journey, I’ve learned that fine-tuning a model isn’t just about the technical work—it’s about understanding the language and culture you’re trying to model. It’s about realizing that AI has the potential to not only solve problems but also preserve the beauty and diversity of languages that make our world so rich.


The Future: Expanding AI for Localized Languages

This is just the beginning. My goal is to continue refining the model, expanding the dataset, and making the translator even more accurate. I envision a future where AI models like this can be used by Tunisians around the world, allowing them to seamlessly communicate in both Tunisian Derja and English while preserving their cultural identity.

If you’re a developer, linguist, or simply someone interested in AI, I encourage you to explore how AI can be used to solve localized problems. The world of AI is vast, and there’s room for every language, every dialect, and every unique voice.


About the Author

I’m Abir Chermiti, a software engineer, AI enthusiast, and business strategist from Tunisia. With over a decade of experience in AI and project management, I’m passionate about using technology to drive innovation and empower communities. Through projects like this AI translator for Tunisian Derja, I aim to bridge gaps, preserve cultures, and unlock new opportunities for people around the world.

Check it out here: https://ellemediaempire.com/tunisian-derja/


TunDerja created by Abir Chermiti


Ferooq Ben Aissa

Développeur Fulls stack NodeJs Angular

6 个月

Love it

赞
回复
Chayma Sellami

PhD in Computer Science | Scientific Consultant R&D | Data & AI Engineer

6 个月

Super Abir, Bravo! but I can't access the link to test it!

Chaima Boutita

Logistics Supply Chain Specialist chez Génie du Composite Etude et Réalisation-GCER Tunisie

6 个月

I like ????

Dr. Manel Rebhi

Data Scientist - LLM - MLOPS - ML Engineer

6 个月

Franchement c’est trop cool ! bravo ??

要查看或添加评论,请登录

Abir Chermiti, CAPM?的更多文章

社区洞察

其他会员也浏览了