Building a Sheng Translation Model with Llama 3

Building a Sheng Translation Model with Llama 3

A Journey into African Language Processing

Introduction

As we strive to develop custom technology solutions for Africa's other unique challenges, it's important to note that language plays a big role in making these innovations resonate with local communities. In the Article, I go over the process of training a translation model that converts English text into Sheng—a vibrant and evolving dialect spoken primarily by the youth in Nairobi, Kenya.

Sheng is a dynamic blend of Swahili, English, and other local languages, originating from the urban youth culture in Nairobi. While it is not an official or national language, Sheng is widely spoken and deeply embedded in the everyday lives of many young Kenyans, who often find it more relatable and less formal than English or Swahili. This makes Sheng an essential medium for informal communication, whether in text messages, conversations with friends, or social media interactions.

Recognizing the importance of Sheng in the daily lives of a significant portion of Kenya's population, I set out to create a model that could accurately translate English into Sheng. The main challenge in this undertaking was the availability of training data. Since Sheng is not officially recognized, there are limited written resources that capture its fluid and ever-changing nature.

The goal of this project is not just to bridge the language gap but to integrate AI more deeply into the fabric of African culture. By developing a tool that speaks the language of the youth, we can create solutions that are not only technologically advanced but also culturally relevant. This project represents a step towards making AI more accessible and meaningful to the people who will use it most.

?

Section 1: Understanding the Problem

Sheng, often described as an acronym for “Swahili-English slang” (Mazrui, 1995), emerged in the 1960s within Nairobi's rich multicultural tapestry. This urban language combines elements of Kiswahili and English with influences from various Kenyan languages, including Kikuyu, Luhya, Dholuo, and Kikamba. Sheng is marked by its linguistic fluidity and adaptability, constantly evolving with new words and phrases as the social context changes. Despite its widespread use, especially among the youth, Sheng does not have an official status, making it difficult to formalize or standardize (Ferrari, 2014).

The linguistic diversity and flexibility that make Sheng unique also present significant challenges when developing AI solutions. Unlike dominant languages such as English or Swahili, Sheng lacks a comprehensive, structured dataset that can be readily used for training AI models. Most available resources are monolingual, primarily in English or Swahili, with Sheng often being used informally in conversations, social media, or specific cultural contexts.

One of the major hurdles in developing AI for underrepresented languages like Sheng is the scarcity of data. Data is the lifeblood of AI; without sufficient, high-quality data, it is nearly impossible to train models that can perform complex tasks such as translation. In the case of Sheng, the challenge is compounded by its informal and evolving nature, making it difficult to capture in a static dataset.

This lack of available data highlights a broader issue within the field of AI: the underrepresentation of non-dominant languages.

Addressing this requires innovative approaches to data collection, model algorithms and model training, ensuring that AI can serve diverse linguistic communities.

Now, let's get into the it as I show how I approached this challenge and the steps I took to build a translation model for Sheng.

?

Section 2: Data Processing

Data Source Discovery

After an extensive search for any available datasets to train a Sheng translation model, I shifted my focus from typical data sources to more specialized sites that had attempted to translate English into Sheng. This led me to discover Shengilia, a blog dedicated to translating the bible’s Gospel Books—Matthew, Mark, Luke, John, and Acts—into Sheng. The blog presented a unique opportunity: it contained the Bible verses in English paired with their direct Sheng equivalents.

Initially, I was concerned that the dataset, being heavily focused on biblical texts, might introduce a bias in the AI model, limiting its applicability to religious contexts. However, during evaluation, I found that this was not the case. On the contrary, the richness of the biblical language, with its complex and abstract ideas, provided a solid foundation for the model. This allowed the model to develop a deeper understanding of linguistic connections in Sheng, enabling it to express complex philosophical ideas even beyond religious contexts.

Data Scraping

To collect the data, I used Python's Beautiful Soup package to scrape the English and Sheng texts from Shengilia. This process involved recursively navigating through over 50 pages on the blog, and extracting all relevant text data. The result was a dataset of over 3,000 English-to-Sheng translation pairs, which I saved in a CSV file for further processing.

Data Processing

The initial scraping process included some basic cleaning steps, such as removing Unicode characters and blank lines. These steps were executed within the data extraction notebook. However, further processing was required to refine the dataset for training. This included manually verifying the accuracy of the translation pairs, ensuring that no records were corrupted during scraping, and removing unnecessary special characters and numerical verse numbers that did not contribute to the translation task.

Data Significance

The final dataset consisted of 3,000 English-to-Sheng translation pairs. To evaluate the impact of dataset size on model performance, I split the dataset into smaller subsets—1,000 and 3,000 records, respectively—and conducted training sessions with both. The dataset’s complexity, derived from the philosophical and abstract nature of biblical texts, allowed the model to translate not just literal meanings but also deeper, more abstract thoughts into Sheng. This makes the dataset particularly valuable for developing AI models that can handle complex linguistic tasks in underrepresented languages.

Section 3: Model Development

Model Choice

After preparing the data, the next step was to select the right model for the task. I decided to fine-tune LLaMA 3 7B (7 billion parameters) for this project. Fine-tuning a large language model (LLM) seemed like a more practical and efficient option compared to building a custom model from scratch, especially given the time constraints and the limited sample size available. Fine-tuning techniques can significantly enhance the performance of existing LLMs on specific downstream tasks (Tian et al., 2023).

I chose LLaMA 3 for several reasons. First, LLaMA 3 is well-suited for developer fine-tuning, with Meta providing strong support and documentation. Additionally, LLaMA 3 has an active community that has developed standard training algorithms, which simplifies the process of fine-tuning and reduces the time needed to create effective training protocols.

For this task, I employed the supervised fine-tuning technique, adjusting 10% of the model's parameters. This approach allowed the model to improve at the new task—translating English to Sheng—while retaining much of its original knowledge, particularly in language structure and nuance(Xu et al., 2024).

Training and Evaluation

The model was trained for 60 epochs, a process that allowed it to gradually learn the translation nuances between English and Sheng. After training, I evaluated the model's performance using various test cases, and the results were promising. Below are some screenshots of how the model performed on different words and phrases.





The model demonstrates a strong ability to understand and mix English and Swahili in its outputs, even when encountering words that were not in the original training vocabulary. This suggests that the fine-tuned LLaMA 3 model effectively captures the linguistic flexibility needed to handle Sheng, which often blends elements from multiple languages.

Section 4: Challenges and Learnings

Technical Hurdles

While the journey to developing this Sheng translation model was rewarding, it was not without its challenges. One of the most significant technical hurdles was managing the tokenization of mixed-language texts. Sheng, by nature, blends elements from multiple languages, including Swahili, English, and various local dialects. This complexity made it challenging to maintain consistent tokenization, especially with words that didn’t fit neatly into the token vocabularies of existing models.

Another challenge was the relatively small size of the dataset. Although I managed to scrape over 3,000 records, the dataset's limited size meant that the model could be prone to overfitting, particularly given the highly specific nature of the biblical content. Balancing between retaining enough complexity to capture the nuances of Sheng and ensuring that the model generalized well to other contexts was a delicate task.

Learning Points

Through this project, I gained several key insights that will inform my future work in NLP and AI:

1.????? Importance of Data Quality: The quality of the data is paramount, especially when dealing with low-resource languages like Sheng. Even small inconsistencies or errors in the dataset can significantly impact the model's performance, highlighting the need for meticulous data preprocessing.

2.????? Tokenization Strategies: Handling mixed-language data requires careful consideration of tokenization strategies. For this project, I had to experiment with different approaches to ensure that the model could effectively process and understand the blend of languages that define Sheng.

3.????? Handling Code-Switching: Code-switching, a common feature of Sheng, where speakers switch between languages mid-sentence, posed a unique challenge. This project taught me the importance of designing models that can adapt to such linguistic phenomena, which are common in many African languages.

Section 5: Future Work

Next Steps

Looking ahead, my immediate focus is on training the model using the full 3,000-record dataset. This will provide a more comprehensive understanding of how the model performs across a broader range of linguistic constructs and contexts. Additionally, I plan to experiment with different fine-tuning techniques, such as transfer learning from other multilingual models, to further enhance the model's capabilities.

I’m also considering expanding the dataset by incorporating more diverse sources of Sheng, including modern media, literature, and social media platforms. This could help the model better grasp the evolving nature of Sheng and improve its performance in real-world applications.

Potential Applications

The possibilities for applying this Sheng translation model are vast. In education, it could be used as a tool to help students learn in a language they are more comfortable with, potentially increasing engagement and comprehension. In the realm of translation services, it could bridge the gap between English and Sheng, making information more accessible to Sheng speakers.

Beyond translation, this model has the potential to contribute to the preservation of African languages. By creating tools that can understand and generate Sheng, we can help ensure that this vibrant, living language continues to thrive in the digital age.

Conclusion

As we continue to explore the possibilities of NLP in underrepresented languages like Sheng, I encourage other AI engineers, linguists, and enthusiasts to contribute to this growing field. There's so much potential for innovation and impact, and I welcome any thoughts, feedback, or collaboration ideas.

Acknowledgements

I would like to extend my gratitude to the open-source community, particularly the developers behind LLaMA 3, for providing the tools and resources that made this project possible. I also want to thank the creators of Shengilia for their work in translating the Bible into Sheng, which served as the foundation for this project.

Links

If you’re interested in exploring the Sheng translation model further, you can find it on Hugging Face [https://huggingface.co/EgesaWO].

For more insights and updates, feel free to check out my related blog posts and articles.

?

References

Tian, K., Mitchell, E., Yao, H., Manning, C. D., & Finn, C. (2023). Fine-tuning Language Models for Factuality.

Xu, H., Zhan, R., Wong, D. F., & Chao, L. S. (2024). Let’s Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model.

Ferrari, A. (2014). Evolution of Sheng during the Last Decade. Les Cahiers D Afrique De lEst, 49, 29–54. https://doi.org/10.4000/eastafrica.340

?

?

Joseph Ongaro

Senior Project Coordinator AI and Data Labeling related projects

7 个月

Sheng is one of the toughest languages to transcribe. Always changing day in day out. Great achievement sir

Noanne Castiel Ndombi

Empowering Youth & Driving Innovation in Africa | AI & Digital Culture Trainer | Innovation Trainer | Digital Communications Consultant | Tech Events Mcee | Entrepreneur | Writer |

7 个月

This is absolutely amazing Wenslous Otema, I'm glad I get to witness this great innovation, and hopefully, someday I will be able to use it too.

Newton Mutugi

Software Engineer | Leader | Tech Enthusiast

7 个月

Amazing work Wenslous Otema. Hoping to use it soon????

要查看或添加评论,请登录

社区洞察

其他会员也浏览了