Accent on Innovation: AI Meets the Welsh
Warren Paull
B2B | Growth | Innovation | Strategy | Marketing | Tech | Specialising in Start-up Growth, Scale-up Operations, and Businesses Transformation
In an age where marketing is driven by personalised content, I was most interested in investigating how voice cloning technology could be used to scale my clients' sales and marketing efforts. Voice cloning has come on leaps and bounds in recent times and is emerging as a viable tool for creating engaging and relatable media at scale.
I found the tech to be great - lots of platforms out there - it reproduced my own voice very well (and improved my 'Estuary/Multicultural-London-English' accent in the process, making me sound altogether very well-spoken)!
However, a specific need from a client presented an unforeseen challenge, that revealed quite a big gap in these new technologies...
Although I was personally quite happy to hear my accent smoothed out to a more 'standard' English, my client is a proud Welsh woman. Her accent is fundamentally important - both to her identity and, from a practical perspective, to maintain authenticity in any marketing we want to do using this technology. The initial tests with available voice cloning systems, while successful in creating realistic voice clones insofar as tone, failed entirely to authentically replicate the Welsh accent. This is more than a technical shortfall; it is a barrier to realising authenticity and cultural resonance.
This was intriguing, and so began another evening lost to AI....
The Challenge
I found myself genuinely curious. What made the Welsh accent so tough to clone? I knew nothing about linguistics or voice cloning technology....but that's never stopped me from wasting my time before!
The issues at play, in summary, were;
Approach - Part 1 - Find a Dictionary:
My immediate thought was to find a linguistic tool that could bridge this gap. A phonetic dictionary specifically for the Welsh accent seemed like the logical solution. Such a resource could provide a system with a clear set of rules to mimic the accent's unique sounds.
However, I quickly discovered that no such dedicated Welsh-accent-to-English-accent dictionary exists.
So, I moved on to deeper exploration...
Approach - Part 2 - Write a Dictionary:
This absence of a specialised phonetic dictionary led me to consider the possibility of creating my own (naivety or hubris?). I thought perhaps I could create one programmatically based on the guiding principles of the accent. My plan was:
It very quickly became clear this was unviable....
Is it possible? Maybe. But it was far too much time and effort to even scope, given I was just dabbling.
I certainly very quickly came to appreciate the intricate relationship between the theoretical aspects of linguistics and the practical challenges of programming and AI technology.
Approach - Part 3 - Voice Analysis:
That idea was scuppered, and my next thought was to see if we could extract measurements from voice analysis which could then be used to form principles or rules for creating the Welsh accent clones. This necessitated a deep dive into voice analysis using ChatGPT's Python abilities.
First, Data:
Kaggle is more than just a database; it's a community-driven platform where data scientists, researchers, and enthusiasts share a plethora of datasets ranging from user-generated content to professionally curated collections. For anyone delving into a data-driven project, Kaggle is useful.
On Kaggle, I found a dataset labelled 'libritts-welsh' which turned out to be a goldmine for my project. This dataset was hundreds of audio samples featuring a wide range of authentic Welsh accents. This diversity was excellent – having a broad spectrum of accents meant we could potentially recognise the nuances and variations inherent in Welsh speech.
Without Kaggle, sourcing such a specialised dataset would have been a far more daunting task. Instead of spending considerable time and resources collecting recordings, I could immediately focus on preprocessing and analysing the data, moving the project forward more efficiently.
Utilising ChatGPT's Python Abilities for Initial Analysis:
The foundational step was to ensure the integrity of the Welsh voice data sourced from Kaggle. Using Python libraries noisereduce for noise reduction and pydub and librosa for volume normalisation and silence trimming, I performed preprocessing to remove any noise and inconsistencies. This ensured a clean dataset, which meant that any features extracted would be representative of the accent itself, not marred by recording artifacts.
Next, I extracted a range of acoustic features from the Welsh voice samples. Applying librosa and praat-py to extract fundamental features such as pitch, intensity, duration, and timbre from the voice samples, as well as quantify more complex speech attributes like formant frequencies and temporal dynamics.
Beyond these basic parameters, my Python scripts also aimed to quantify subtler aspects of speech that contribute to the unique cadence and melody of the Welsh accent. For instance, by analysing the speech rate and the duration of pauses between words, I could glean insights into the rhythmic patterns that are characteristic of the accent.
This initial analysis provided a detailed dataset of the Welsh accent's acoustic properties. The outputs were extensive. CSV files with comprehensive records of the accent's acoustic features. Visualisation plots offered a glimpse into the data's spread and patterns, highlighting the variability and sameness across different Welsh speakers.
However, once I had the outputs, it became clear this wasn't going to work either...
The realisation that the quantitative approach was insufficient got me thinking...I was still trying to 'invent' a solution to this problem...but there are systems that work with accents all the time...these systems must already have the 'rules' in place that I was trying to create, or they wouldn't be able to recognise the Welsh accent.
I had to adapt, not invent...
领英推荐
Approach - Part 4 - Lightbulb Moment
How did other systems, such as Automatic Speech Recognition (ASR), successfully handle multiple accents, including Welsh?
This question led me to an insight: ASR systems, which are designed to understand and transcribe spoken language, must inherently have a method for interpreting various accents, including the nuances of Welsh speech.
This realisation prompted a shift in my approach from trying to create something new to exploring how existing technologies could be repurposed or adapted to meet our needs. To understand this better, I had to read into how ASR systems work and why they are effective in handling different accents:
My aim was to understand if the principles used by ASR systems could be applied to voice cloning. If ASR can understand and transcribe the Welsh accent, could a similar methodology be used, sort of in reverse, to clone it?
Solution: Feature Extraction Using a Pre-Trained ASR Model
My search led me to Hugging Face , a hub for advanced machine learning models, where I found a pre-trained Welsh ASR model. This model was already fine-tuned to process and understand Welsh-accented speech, making it an ideal tool for my project. Like Kaggle , Hugging Face is invaluable.
Integrating this model into my workflow, I was able to extract more abstract features from the Welsh voice data. These 'deep features' provided a much more profound understanding of the accent, including intricate phonetic details that my initial analysis could not....
The process in short:
By extracting these deep features with a pre-trained ASR model, I could capture a more complete representation of the Welsh accent, addressing the project's earlier limitations.
Result
When we talk about Automatic Speech Recognition (ASR) models like wav2vec2, the primary purpose is to convert spoken language into written text, i.e., transcriptions. But in doing so, the model has to internally understand a lot about the phonetic and acoustic properties of the speech it's transcribing.
How Does the Model Understand the Accent?
?
How To Capture the Accent in Features?
Extracting features from one of the final layers of the model. These features are essentially the model's internal representation of the audio clip, capturing all its learned knowledge about the phonetic properties of that clip, including the Welsh accent.
?
Why Is This Useful??
These deep features act as a kind of "accent signature". For a trained model, Welsh-accented speech will produce different feature values than, say, American-accented speech, even if the transcribed words are the same.
With this "accent signature", it is possible to perform a variety of tasks:
?
What About Phonetics?
What about where we started - direct phonetic transcription (like the International Phonetic Alphabet)? It provides a standardised way to describe the sounds of spoken language.
We don't have a direct phonetic transcription from our current use case. Instead, we have a high-dimensional feature vector that encapsulates a lot of phonetic information in a non-standard way.
For many tasks, this is even better than a standard phonetic transcription because it's richer and captures more nuances.
Conclusion
In conclusion, while we're not getting a direct phonetic transcription, we are capturing the phonetic characteristics of the Welsh accent in a way that's highly useful for many computational tasks.
?
Reflection
Although the solution I've conceptualised might not be groundbreaking for experts in the field (as I understand it since, it's a somewhat novel approach, but an area of active research), it demonstrates the power of AI in enhancing our problem-solving capabilities and capacity for innovation.
I hope what I have documented exemplifies not only the value of iterative exploration and intellectual curiosity but, also, the accelerated pace of research and development achievable through a strategic problem-solving process leveraging AI.
With no prior knowledge of any relevant area, and mostly unable to code, I started out pursuing a dedicated Welsh-accent-to-English-accent dictionary....but my approach quickly evolved. I delved into more innovative avenues, including the possibility of creating a phonetic dictionary programmatically. Although each step did not culminate in a definitive solution, it was instrumental in deepening my understanding of the relationship between linguistics and technology, which in turn allowed me to make the connections necessary to propel us towards a solution.
The real acceleration in this journey stemmed from AI tools, like OpenAI 's ChatGPT, which I utilised to rapidly assimilate complex ASR principles, bypassing traditional prolonged learning phases. This immediate grounding in technical knowledge together with the ability to quickly generate Python code, enabled swift pivots in exploration and testing, condensing what might have been weeks of R&D into a much shorter timeframe.
In just one evening, I was able to absorb the essential principles and intricacies of voice cloning and ASR systems. This rapid learning curve is a testament to how AI can streamline complex research, making it more accessible and less time-consuming.
Each phase, from dictionary-creation attempts to scrutinising ASR systems, was marked by a blend of AI-assisted research and creative, flexible problem-solving. I hope this approach underscores the importance of a dynamic, adaptive mindset in tackling problems and innovation.
AI is massively important and helpful - but the approach is fundamental - methodical yet agile exploration, where the real breakthrough lies not solely in the tools we use but in how we learn and adapt continually, driven by curiosity.
As for my findings, I have handed them on to someone far better placed to implement the learnings, and I wish them all the best in doing so. We may not quite yet have our Welsh-accented clone but, I'm sure it's not too many months away.
You managed to do all this in one evening??? Very funny, trying to persuade AI to be Welsh. What about a Belfast accent? Or Geordie? Apparently the Yorkshire accent is the most trusted. I suppose this means that in due course we'll have bots talking to us in pure Leeds (Leedsish?).
Excellent Warren! Now I wonder if AI can determine the difference between a Newport… Cardiff…Swansea and Valley accent?! ????