Accent on Innovation: AI Meets the Welsh

Accent on Innovation: AI Meets the Welsh

In an age where marketing is driven by personalised content, I was most interested in investigating how voice cloning technology could be used to scale my clients' sales and marketing efforts. Voice cloning has come on leaps and bounds in recent times and is emerging as a viable tool for creating engaging and relatable media at scale.

I found the tech to be great - lots of platforms out there - it reproduced my own voice very well (and improved my 'Estuary/Multicultural-London-English' accent in the process, making me sound altogether very well-spoken)!

However, a specific need from a client presented an unforeseen challenge, that revealed quite a big gap in these new technologies...

Although I was personally quite happy to hear my accent smoothed out to a more 'standard' English, my client is a proud Welsh woman. Her accent is fundamentally important - both to her identity and, from a practical perspective, to maintain authenticity in any marketing we want to do using this technology. The initial tests with available voice cloning systems, while successful in creating realistic voice clones insofar as tone, failed entirely to authentically replicate the Welsh accent. This is more than a technical shortfall; it is a barrier to realising authenticity and cultural resonance.

This was intriguing, and so began another evening lost to AI....


The Challenge


I found myself genuinely curious. What made the Welsh accent so tough to clone? I knew nothing about linguistics or voice cloning technology....but that's never stopped me from wasting my time before!

The issues at play, in summary, were;

  • Uniqueness of the Welsh Accent: The Welsh accent, like all regional accents, has its own unique intonations, stresses, and phonetic variations that differentiate it from English accents.
  • Implications for Voice Cloning: Voice cloning systems often rely on phonetic information to generate accurate and natural-sounding speech. Without a dictionary that captures the nuances of the Welsh accent, creating a voice clone that authentically reproduces this accent becomes challenging.
  • Difficulty in Training: For machine learning models, especially in speech and voice, data is king. Without a phonetic translator or dictionary, curating a dataset that explicitly captures the phonetic nuances of the Welsh accent is arduous.
  • Increased Ambiguity: Without a specific phonetic guide, there are ambiguities in how different words or sounds are represented, leading to inconsistencies in training and potentially in the voice clone's output.
  • Generalised Models Don't Suffice: Models trained on general English datasets do not capture the specificities of the Welsh accent. This leads to the voice clone sounding 'off' or not authentic when attempting the Welsh accent.?


Approach - Part 1 - Find a Dictionary:


My immediate thought was to find a linguistic tool that could bridge this gap. A phonetic dictionary specifically for the Welsh accent seemed like the logical solution. Such a resource could provide a system with a clear set of rules to mimic the accent's unique sounds.

However, I quickly discovered that no such dedicated Welsh-accent-to-English-accent dictionary exists.

So, I moved on to deeper exploration...



Approach - Part 2 - Write a Dictionary:


This absence of a specialised phonetic dictionary led me to consider the possibility of creating my own (naivety or hubris?). I thought perhaps I could create one programmatically based on the guiding principles of the accent. My plan was:

  • Analyse Research on the Welsh Accent: Utilise a Large Language Model (LLM) to analyse a large volume of existing linguistic research papers on the Welsh accent. The idea was to extract the underlying principles or 'rules' that define the accent's unique characteristics, such as its intonation, stress patterns, and vowel and consonant variations. This analysis could then provide a structured understanding of the Welsh accent, breaking it down into identifiable, rule-based components.
  • Programmatic Implementation Against a Standard Phonetic Dictionary: Without the time for manual intervention, the idea was to apply the principles programmatically to a standard English phonetic dictionary. I would modify this dictionary to reflect the specific phonetic characteristics of the Welsh accent. By doing so, I aimed to create a 'Welsh accent version' of the phonetic dictionary we were missing, which could then potentially be used by voice cloning systems to produce a more authentic Welsh accent.


It very quickly became clear this was unviable....


  • Extracting Rules from Research: While LLMs are powerful in analysing large texts, distilling concise and accurate phonetic rules from academic research still requires careful interpretation and an understanding of both linguistics and Welsh accent characteristics.
  • Implementing Rules Programmatically: The challenge here was not just in the programming but in ensuring that these rules could be applied in a way that accurately captures the nuances of the Welsh accent. This would have needed complex algorithmic work, where even minor deviations in rule application could lead to significant inaccuracies in accent replication.
  • Bridging the Gap Between Theory and Application: The transition from theoretical phonetic rules to a practical, usable phonetic dictionary would have needed a complex melding of linguistic theory and technical implementation. It required a balance of linguistic insight and technical skill to create a tool that was both accurate in its representation of the Welsh accent and functional for use in voice cloning.

Is it possible? Maybe. But it was far too much time and effort to even scope, given I was just dabbling.


I certainly very quickly came to appreciate the intricate relationship between the theoretical aspects of linguistics and the practical challenges of programming and AI technology.



Approach - Part 3 - Voice Analysis:


That idea was scuppered, and my next thought was to see if we could extract measurements from voice analysis which could then be used to form principles or rules for creating the Welsh accent clones. This necessitated a deep dive into voice analysis using ChatGPT's Python abilities.


First, Data:

Kaggle is more than just a database; it's a community-driven platform where data scientists, researchers, and enthusiasts share a plethora of datasets ranging from user-generated content to professionally curated collections. For anyone delving into a data-driven project, Kaggle is useful.

On Kaggle, I found a dataset labelled 'libritts-welsh' which turned out to be a goldmine for my project. This dataset was hundreds of audio samples featuring a wide range of authentic Welsh accents. This diversity was excellent – having a broad spectrum of accents meant we could potentially recognise the nuances and variations inherent in Welsh speech.

Without Kaggle, sourcing such a specialised dataset would have been a far more daunting task. Instead of spending considerable time and resources collecting recordings, I could immediately focus on preprocessing and analysing the data, moving the project forward more efficiently.


Utilising ChatGPT's Python Abilities for Initial Analysis:

The foundational step was to ensure the integrity of the Welsh voice data sourced from Kaggle. Using Python libraries noisereduce for noise reduction and pydub and librosa for volume normalisation and silence trimming, I performed preprocessing to remove any noise and inconsistencies. This ensured a clean dataset, which meant that any features extracted would be representative of the accent itself, not marred by recording artifacts.

Next, I extracted a range of acoustic features from the Welsh voice samples. Applying librosa and praat-py to extract fundamental features such as pitch, intensity, duration, and timbre from the voice samples, as well as quantify more complex speech attributes like formant frequencies and temporal dynamics.

Beyond these basic parameters, my Python scripts also aimed to quantify subtler aspects of speech that contribute to the unique cadence and melody of the Welsh accent. For instance, by analysing the speech rate and the duration of pauses between words, I could glean insights into the rhythmic patterns that are characteristic of the accent.

This initial analysis provided a detailed dataset of the Welsh accent's acoustic properties. The outputs were extensive. CSV files with comprehensive records of the accent's acoustic features. Visualisation plots offered a glimpse into the data's spread and patterns, highlighting the variability and sameness across different Welsh speakers.


However, once I had the outputs, it became clear this wasn't going to work either...


  1. The Limitations of Static Data: The data points derived from Python analysis were static. They represented averages and distributions of acoustic features within a controlled dataset. While useful, this information didn't capture the fluidity and adaptability of the Welsh accent in spontaneous speech. Real-life communication involves shifts in tone, pace, and emphasis depending on context, emotion, and interaction, which the data didn't fully encapsulate.
  2. The Need for Dynamic Analysis: To create a voice clone, the system would need to understand the 'how' and 'why' behind the accent's use in different social and linguistic contexts. For example, a question might be asked with a rising intonation, or emphasis might be placed differently to express surprise or sarcasm. These dynamic speech elements are crucial for a voice clone to sound convincingly human.


The realisation that the quantitative approach was insufficient got me thinking...I was still trying to 'invent' a solution to this problem...but there are systems that work with accents all the time...these systems must already have the 'rules' in place that I was trying to create, or they wouldn't be able to recognise the Welsh accent.


I had to adapt, not invent...



Approach - Part 4 - Lightbulb Moment


How did other systems, such as Automatic Speech Recognition (ASR), successfully handle multiple accents, including Welsh?

This question led me to an insight: ASR systems, which are designed to understand and transcribe spoken language, must inherently have a method for interpreting various accents, including the nuances of Welsh speech.

This realisation prompted a shift in my approach from trying to create something new to exploring how existing technologies could be repurposed or adapted to meet our needs. To understand this better, I had to read into how ASR systems work and why they are effective in handling different accents:

  • ASR Systems and Accent Recognition:ASR systems are trained on vast datasets of spoken language, which include a variety of accents. This training enables them to recognise and transcribe speech accurately, regardless of regional pronunciation differences. These systems use sophisticated algorithms and machine learning models to analyse speech patterns. They look for specific characteristics in the sound waves, such as pitch, tone, and rhythm, to identify words and phrases.
  • Understanding Accents: In the context of accents, ASR systems are adept at distinguishing subtle variations in pronunciation. They do not necessarily have a set of hard-coded 'rules' for each accent. Instead, they learn to recognise the accent's characteristics through exposure to diverse speech samples during their training. For instance, a system trained with a significant amount of Welsh-accented speech data would learn to identify the specific features that distinguish the Welsh accent from others. Obviously, the less spoken the accent, the more difficult to get the data and train with it

My aim was to understand if the principles used by ASR systems could be applied to voice cloning. If ASR can understand and transcribe the Welsh accent, could a similar methodology be used, sort of in reverse, to clone it?


Solution: Feature Extraction Using a Pre-Trained ASR Model


My search led me to Hugging Face , a hub for advanced machine learning models, where I found a pre-trained Welsh ASR model. This model was already fine-tuned to process and understand Welsh-accented speech, making it an ideal tool for my project. Like Kaggle , Hugging Face is invaluable.

Integrating this model into my workflow, I was able to extract more abstract features from the Welsh voice data. These 'deep features' provided a much more profound understanding of the accent, including intricate phonetic details that my initial analysis could not....

The process in short:

  1. Model Selection and Setup: A pre-trained ASR model specifically tuned for the Welsh language, techiaith/wav2vec2-xlsr-ft-cy, was selected from Hugging Face's model repository. The model was loaded into a Python environment using the transformers library and set up for inference.
  2. Feature Extraction Process: The Welsh audio data from Kaggle was fed into the ASR model, which processed it through its neural network layers. The model's architecture is designed to transform the raw audio signal into a set of features that capture various aspects of the speech. Deep features are extracted from one of the final layers of the neural network before the output layer. These features are high-dimensional vectors that encode the rich information the model has learned about the Welsh accent's phonetic and prosodic characteristics.
  3. Types of Deep Features Extracted: Phonetic Features: These include the representation of phonemes, the distinct units of sound in speech. The model captures the nuances of how these sounds are produced in the Welsh accent.Prosodic Features: Prosodic elements like stress, rhythm, and intonation are encoded within these features, reflecting the melody and flow of Welsh speech.Acoustic Features: Fundamental frequency (pitch), formants, and energy distribution over time are part of the acoustic features that the model captures.Temporal Features: Timing information, such as the duration of phonemes and the pace of speech, is also represented within the deep features.
  4. Output: Raw high-dimensional feature vectors extracted from the ASR model. Each row corresponds to a set of features for a given audio sample. After extracting deep features, a dimensionality reduction technique, PCA, was applied to simplify the dataset while retaining the most informative aspects. The resulting principal components were saved in a file.


By extracting these deep features with a pre-trained ASR model, I could capture a more complete representation of the Welsh accent, addressing the project's earlier limitations.



Result


When we talk about Automatic Speech Recognition (ASR) models like wav2vec2, the primary purpose is to convert spoken language into written text, i.e., transcriptions. But in doing so, the model has to internally understand a lot about the phonetic and acoustic properties of the speech it's transcribing.


How Does the Model Understand the Accent?

  1. Learning from Data: The wav2vec2 model I used has been fine-tuned on Welsh data. During this fine-tuning process, it's exposed to a multitude of Welsh-accented English speech samples. The model "learns" to recognise the unique phonetic characteristics and nuances of the Welsh accent in order to predict the correct transcription.
  2. Internal Representations: When the model processed my audio clips, it passed the data through several layers of the neural network. At each layer, the raw audio data is transformed into a more abstract representation. Early layers might capture basic sound patterns, while deeper layers capture more complex structures, including phonetic and linguistic features. By the time you get to the final layers, the model has a rich, high-dimensional representation of the audio that encompasses its phonetic content and accent characteristics.

?

How To Capture the Accent in Features?

Extracting features from one of the final layers of the model. These features are essentially the model's internal representation of the audio clip, capturing all its learned knowledge about the phonetic properties of that clip, including the Welsh accent.

?

Why Is This Useful??

These deep features act as a kind of "accent signature". For a trained model, Welsh-accented speech will produce different feature values than, say, American-accented speech, even if the transcribed words are the same.

With this "accent signature", it is possible to perform a variety of tasks:

  1. Accent Classification: If you had audio samples from various accents and extracted features for all of them, you could train a classifier to identify the accent based on these features.
  2. Accent Analysis: You could use statistical or machine learning methods to analyse variations within the Welsh accent, like regional differences.
  3. Training Other Models: If you were building a system to, say, evaluate Welsh accent authenticity or strength, these features would be invaluable.
  4. Voice Cloning: If the goal is to clone a specific accent, as mine was, these features can inform the synthesis process, ensuring that the generated speech not only has the correct pronunciation but also carries the distinctive prosodic and rhythmic characteristics of the accent.

?

What About Phonetics?

What about where we started - direct phonetic transcription (like the International Phonetic Alphabet)? It provides a standardised way to describe the sounds of spoken language.

We don't have a direct phonetic transcription from our current use case. Instead, we have a high-dimensional feature vector that encapsulates a lot of phonetic information in a non-standard way.

For many tasks, this is even better than a standard phonetic transcription because it's richer and captures more nuances.


Conclusion

In conclusion, while we're not getting a direct phonetic transcription, we are capturing the phonetic characteristics of the Welsh accent in a way that's highly useful for many computational tasks.

?


Reflection


Although the solution I've conceptualised might not be groundbreaking for experts in the field (as I understand it since, it's a somewhat novel approach, but an area of active research), it demonstrates the power of AI in enhancing our problem-solving capabilities and capacity for innovation.

I hope what I have documented exemplifies not only the value of iterative exploration and intellectual curiosity but, also, the accelerated pace of research and development achievable through a strategic problem-solving process leveraging AI.

With no prior knowledge of any relevant area, and mostly unable to code, I started out pursuing a dedicated Welsh-accent-to-English-accent dictionary....but my approach quickly evolved. I delved into more innovative avenues, including the possibility of creating a phonetic dictionary programmatically. Although each step did not culminate in a definitive solution, it was instrumental in deepening my understanding of the relationship between linguistics and technology, which in turn allowed me to make the connections necessary to propel us towards a solution.

The real acceleration in this journey stemmed from AI tools, like OpenAI 's ChatGPT, which I utilised to rapidly assimilate complex ASR principles, bypassing traditional prolonged learning phases. This immediate grounding in technical knowledge together with the ability to quickly generate Python code, enabled swift pivots in exploration and testing, condensing what might have been weeks of R&D into a much shorter timeframe.

In just one evening, I was able to absorb the essential principles and intricacies of voice cloning and ASR systems. This rapid learning curve is a testament to how AI can streamline complex research, making it more accessible and less time-consuming.

Each phase, from dictionary-creation attempts to scrutinising ASR systems, was marked by a blend of AI-assisted research and creative, flexible problem-solving. I hope this approach underscores the importance of a dynamic, adaptive mindset in tackling problems and innovation.

AI is massively important and helpful - but the approach is fundamental - methodical yet agile exploration, where the real breakthrough lies not solely in the tools we use but in how we learn and adapt continually, driven by curiosity.

As for my findings, I have handed them on to someone far better placed to implement the learnings, and I wish them all the best in doing so. We may not quite yet have our Welsh-accented clone but, I'm sure it's not too many months away.



You managed to do all this in one evening??? Very funny, trying to persuade AI to be Welsh. What about a Belfast accent? Or Geordie? Apparently the Yorkshire accent is the most trusted. I suppose this means that in due course we'll have bots talking to us in pure Leeds (Leedsish?).

Excellent Warren! Now I wonder if AI can determine the difference between a Newport… Cardiff…Swansea and Valley accent?! ????

要查看或添加评论,请登录

社区洞察

其他会员也浏览了