The ongoing effort to improve Rossa process from an AI Researcher

The ongoing effort to improve Rossa process from an AI Researcher

Meeting diarization is one of the focuses of the Rossa voice team. Meeting diarization aims at answering “who spoke what” in a meeting, which has various applications in daily conversations and call centers.?

In meeting diarization, speaker diarization is one of the most important tasks, which is to specify who is speaking in the conversation. It is the technical process of splitting up an audio recording into smaller segments, where each segment contains the voices of only one speaker and recognizing the speakers in each segment accordingly. When the speakers are unknown, speaker diarization is the task of clustering speech segments into different groups, where each corresponds to one speaker.?

Essentially, speaker diarization builds a feature extraction model, namely speaker embedding model, to extract the voice characteristics from a given audio segment. It then groups the audios based on the similarity of the extracted features.

Rossa Voice team started the meeting diarization product, namely Facilun, from scratch where no research of speaker recognition was investigated. Understanding the difficulties of the team in voice distinguishing, Aiden decided to research and make an effort in this process. Aiden faced many challenges at the beginning, where he didn’t know which model is most effective and what dataset can be used. Through many experiments, failed and tried again, and…failed again, Aiden can gradually handle the technologies. He understood which models are suitable, when they failed and why they failed. Up to now, it is hard to say that we have a perfect speaker diarization model, but some good baseline models were built. We are reaching the level of application, where the requirement of accuracy is over 90%, meaning that error rates lie below 10% in normal recording conditions.

By the separation of speakers, the speech-to-text output becomes more readable. Moreover, it supports many following tasks of ASR output processing. Punctuation, which is to segment the ASR output onto sentences, is more accurate. Information extraction from the ASR output is easier as we know whose important sentences should be extracted. Meeting content can be better classified as we know it belongs to the call center’s operator or from the customers, etc.

Almost all members of Rossa are involved in the development of the diarization modules. Some of the core features are developed in collaboration with the ASR module, and some NLP techniques are applied. Diarization was improved over time by implementing techniques that have similar characteristics to the ones used in NLP and ASR. It was done through collaboration among AI Researchers in Rossa’s projects. One of the lessons learned based on collaboration is listening and learning to compromise. It’s essential to listen closely to each team member’s ideas, feedback, advice, as well as to respond in a considerate and respectful manner. Reaching a compromise is the best way to approach different perspectives.

Although speaker diarization has achieved certain results and improved various tasks, there still remain some limitations in terms of performance, especially when the recording environment is bad. Speaker diarization needs to be further improved to satisfy the requirements of such applications. In addition, speaker diarization could be meaningless if it is used alone. Many efforts of adaptation and collaboration with other speech processing modules would be made toward more efficient application to bring true values to Rossa’s clients.



Tran Minh Quan

Senior Developer Technologist @ NVIDIA | Visual Computing, GPU Computing

2 年

Artist Idol.

回复
Ernesta Dao

Project Manager

2 年

Aiden Idol <3

要查看或添加评论,请登录

社区洞察

其他会员也浏览了