Cynthia Jayne Amol: Curating datasets for low-resource Kenyan languages

Cynthia Jayne Amol: Curating datasets for low-resource Kenyan languages

In a nutshell, what do you do for work?

As a systems support specialist, I maintain the institution’s Enterprise Resource Planning (ERP) software, databases and other systems while offering technical assistance to system users.?

As a technical research assistant for the African Next Voices project through KenCorpus , I am in charge of data curation for translation into five Kenyan languages: Dholuo, Gikuyu, Maasai, Somali and Kalenjin. The project is a pilot data collection initiative, targeting select underrepresented languages in Kenya - languages that lack vast resources in terms of data and models.

Currently, I am consolidating relevant open source datasets and liaising with stakeholders of proprietary data to have access to the data. The speech and textual data are then processed and translated by teams in the local communities organized by Maseno, USIU, Kabarak and Dedan Kimathi universities (Kenya) and LDRI. The data, once translated to these local languages, can be used to train machine learning models for diverse tasks including Machine Translation (MT), Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Speech-to-Text (STT). These models will help the community benefit from artificial intelligence (AI) driven automation - bringing AI closer home by designing models with real community use cases. Some of the key sectors being explored for use case implementation include agriculture, healthcare and education.


How did you get into the natural language processing space?

While exploring thesis options for my master's studies, I was introduced to natural language processing (NLP) by my thesis supervisor and mentor, Lilian D. A. Wanzare, PhD. As I started researching potential topics, I began to realize how impactful NLP for low-resource languages was.?The limited inclusion of African languages within the NLP space, was a great motivation to enter the field. Most African indigenous languages are classified as low-resource due to limited availability of quality datasets and models. Without datasets in local languages, we fail to preserve our languages for the future generation. Further, without these resources, AI-driven automation models cannot be created, limiting the advancement of AI for our communities.


What does your day-to-day work involve?

As any corporate champion, I start my day answering emails. I offer technical support to virtual clients and troubleshoot any system issues before opening the doors to physical clients. On some days, we conduct user trainings to update users on fixed bugs or collect additional system requirements.

After hours, I work with research teams on data collection and processing. Textual and speech data are primarily collected from the community, though some are translated from existing open source datasets. I am also involved in some language modelling projects - creating models that are able to classify text. During the weekends, I mentor students working on ML/NLP projects.


Which aspects of your work do you enjoy the most?

Working in systems support, I interact with clients on a daily basis - some of whom are not tech savvy. I have learnt the art of patience and being able to guide with care. I enjoy guiding users to a point of knowledge. The realization of how easy a task is when the user successfully executes it brings me joy.?

In language modelling, creating resources that are useful in language preservation, misinformation detection and solving real societal problems -such as agricultural or health support using ASR and TTS models- makes the hard work enjoyable. Realizing that the work we do now may impact future generations of NLP researchers, drives us to scale projects as much as possible.


What are some memorable projects you have worked on?

Creating a Swahili-English code-switched political misinformation classification dataset and model for my master's thesis was quite interesting. Aside from the hectic but interesting data annotation (labelling text in predefined classes to teach a model to perform classification tasks) sessions that I was part of, the project had an impactful outcome which I was proud of.

?The African Next Voices project that I am working on at the moment has also presented me with several challenges which foster growth. When I took up the project, I was not aware of how big it was going to get and the fact that it takes me out of my comfort zone makes every milestone quite memorable.


How are you creating impact through your work?

In addition to being involved in projects that create real solutions to local societal problems, I believe that I am laying ground for future researchers to venture into the development of technologies for low-resource languages. There has been a lot of talk about how researchers in Africa ought to continuously be involved in the development of African solutions to African problems. My work is centered on this.?


What are your current research interests in ML/NLP?

Previously, I have predominantly worked with textual data. I am currently exploring speech data and the development of solutions such as ASR, TTS and STT for local languages in Kenya. Preservation of speech data, especially for low-resource African languages is essential.?


How long do you see yourself in the NLP/ML space?

Honestly, for as long as possible. I believe that the current research in NLP/ML space, especially in the African context, has barely scratched the surface. There is much more to be done, and I look forward to being part of the movement for as long as I can.?


Podcast recommendation


Methali

35. Ihsani (hisani)haiozi.Kindness does not go rotten.


Quote

It is better to die for an idea that will live, than to live for an idea that will die.” Steve Biko

要查看或添加评论,请登录

社区洞察

其他会员也浏览了