SSML - The CSS of TTS's
This is not a technical article, but speaking from geek to geek I will assume that you know the basics of HTML and understand the importance of CSS as a guide for web design. In practice, it is in this external file that we define things that are simultaneously linked to the aesthetics, accessibility and productivity of the entire project. This style file is responsible for the user experience of accessing and enhancing the performance of those who, for example, develop a website. Many good things can happen from the right combination of these two files.
We are still in the infancy of voice-driven interfaces and gadgets, so having a few tricks that can refine or improve your skill experience can make all the difference when publishing to Alexa or Google Assistant. One way to make a difference is by using SSML markups, or Speech Synthesis Markup Language, to add dimension and depth to conversational projects.
Just as CSS serves as a style guide for HTML in a web project, SSML will allow you to edit the synthetic voice of your VUI project so that it sounds more natural or, if you wish, similar to zombie for a voice game. Believe; We are giving little value to issues such as speed, tone of voice and dramatic pauses when it comes to conversational skills based in TTS's. This was already studied in the early 1970s by Albert Meharabian.
Mehrabian, prosody and SSML
Albert Mehrabian, in his 1971 Secret Messages book, cited for the first time a study that pointed to the voice as responsible for only 7% of a message, leaving the remaining 93% under the responsibility of facial expressions, body movements and some prosodies, which they are the linguistic functions associated with the intonation, rhythm and tone of the human voice.
The use of SSML is recommended by W3C and its markings are based on a mix of JSML - Java Speech Markup Language - and XML. These tags can change the volume, duration, tone, and other aspects of synthetic speech directly in the skill code and will work differently in Alexa, Google Assistant, Watson or Cortana.
The human brain, whenever it identifies a new interlocutor, tries to fill some information gaps to understand how to "catalog" this new contact in our brain agenda. Information that is not available, such as a voice on the radio, will be filled with data that our brains perceive as "socially acceptable" - a condition known as the "Ventriloquist Defect." This is why, for example, the appearance of a person we only knew on the phone will always look different than expected.
The Ventriloquian Defect has been cited in many studies as a source of chauvinism, misogyny and racist bias in AI. That is precisely why the need to deconstruct these "false truths" as "people prefer female assistants" becomes so fundamental. Will we continue to transform the global chatbot ecosystem into a virtual Themyscira?
Joshua Davis and the feelings of IBM Watson
We create personas to bring to life the interactive interfaces we increasingly use every day, using avatars and personal names in the hope that users will have a more comfortable and personalized experience, engaging more and perhaps consuming better. But in a voice interface with limited image support, such as Echo Spot, how can you provide this extra information?
One of the most relevant experiences ever created for using the persona on a voice-only platform was IBM Watson's participation in a special episode of the American show Jeopardy.
At 2m11seg Joshua Davis - CEO of Praystation - demonstrates how he created a particle system that can react to the emotions of the responses. It's amazing how many variables each emotion can be represented and how these emotions can influence the voice.
Audios, poetry and prosody
The use of TTS's audio can be easily implemented in your skill using SSML markers and they often enrich the user experience by simply enabling code editing and using tags like <prosody> accompanied, in it's case, by Rate, Pitch or volume.
Using these markers in combination makes it possible to emphasize specific words or to accelerate entire parts of synthesized sentences.
The synthesized voice is small and offers a plain or flat feel to use technical terms. This lack of depth makes it impossible, for example, to create skills involving poetic quotes or voice games.
The <audio> tag will add a new experience layer to your skill.
The two most popular platforms - Amazon and Google - offer their own audio banks that can only be incorporated into the code of their proprietary gadgets, but free audio banks that work on both platforms are quite easy to find on the internet.
There are a few hacks that can brighten your skill designed with TTS:
. Create skills using only pre-recorded voices and integrate these files via <audio>, so whenever you need some extra excitement you won't have to settle for the limitations of the few synthetic voices available on these platforms. This solves, for example, the problem with poetic quotes and storytelling of voice games.
. HACK! Record synthetic voice - a feature available in Google Actions - and treat the file in your favorite audio software, then integrate the sound file again via <audio>, the effect is surprising and quite useful for voice games who was made using only TTS.
. When using sound files, it is good practice to create an introductory skill track so that the user is already familiar with the feature when listening to any incidental effects or tracks.
. The gender marker on some platforms is capable of temporarily changing the gender of your skill voice, giving the impression that we are talking to two different individuals at same time. It is not necessary to explain the usefulness of this feature for the creation of more elaborate games.
SSML Simulators
There are great SSML simulators that can show in practice some of the subjects that were covered in this article, VocalWare is one of the simplest and serves to realize the importance of these markers.
We still living in childhood of voice interfaces, but digital voices based in TTS and STT will offer more immersive and interactive experiences in few years.
If a simple giggle was enough to viralize Alexa videos all over the world, just imagine what good use of SSML can do for your conversational project.
AI Researcher
3 年That day when Srini Janarthanam read an article of mine. ??