登录查看更多内容

SSML - The CSS of TTS's

Billy Garcia

AI Researcher

发布日期: 2021年7月13日

This is not a technical article, but speaking from geek to geek I will assume that you know the basics of HTML and understand the importance of CSS as a guide for web design. In practice, it is in this external file that we define things that are simultaneously linked to the aesthetics, accessibility and productivity of the entire project. This style file is responsible for the user experience of accessing and enhancing the performance of those who, for example, develop a website. Many good things can happen from the right combination of these two files.

We are still in the infancy of voice-driven interfaces and gadgets, so having a few tricks that can refine or improve your skill experience can make all the difference when publishing to Alexa or Google Assistant. One way to make a difference is by using SSML markups, or Speech Synthesis Markup Language, to add dimension and depth to conversational projects.

Just as CSS serves as a style guide for HTML in a web project, SSML will allow you to edit the synthetic voice of your VUI project so that it sounds more natural or, if you wish, similar to zombie for a voice game. Believe; We are giving little value to issues such as speed, tone of voice and dramatic pauses when it comes to conversational skills based in TTS's. This was already studied in the early 1970s by Albert Meharabian.

Mehrabian, prosody and SSML

Albert Mehrabian, in his 1971 Secret Messages book, cited for the first time a study that pointed to the voice as responsible for only 7% of a message, leaving the remaining 93% under the responsibility of facial expressions, body movements and some prosodies, which they are the linguistic functions associated with the intonation, rhythm and tone of the human voice.

The use of SSML is recommended by W3C and its markings are based on a mix of JSML - Java Speech Markup Language - and XML. These tags can change the volume, duration, tone, and other aspects of synthetic speech directly in the skill code and will work differently in Alexa, Google Assistant, Watson or Cortana.

The human brain, whenever it identifies a new interlocutor, tries to fill some information gaps to understand how to "catalog" this new contact in our brain agenda. Information that is not available, such as a voice on the radio, will be filled with data that our brains perceive as "socially acceptable" - a condition known as the "Ventriloquist Defect." This is why, for example, the appearance of a person we only knew on the phone will always look different than expected.

The Ventriloquian Defect has been cited in many studies as a source of chauvinism, misogyny and racist bias in AI. That is precisely why the need to deconstruct these "false truths" as "people prefer female assistants" becomes so fundamental. Will we continue to transform the global chatbot ecosystem into a virtual Themyscira?

Joshua Davis and the feelings of IBM Watson

We create personas to bring to life the interactive interfaces we increasingly use every day, using avatars and personal names in the hope that users will have a more comfortable and personalized experience, engaging more and perhaps consuming better. But in a voice interface with limited image support, such as Echo Spot, how can you provide this extra information?

One of the most relevant experiences ever created for using the persona on a voice-only platform was IBM Watson's participation in a special episode of the American show Jeopardy.

At 2m11seg Joshua Davis - CEO of Praystation - demonstrates how he created a particle system that can react to the emotions of the responses. It's amazing how many variables each emotion can be represented and how these emotions can influence the voice.

Audios, poetry and prosody

The use of TTS's audio can be easily implemented in your skill using SSML markers and they often enrich the user experience by simply enabling code editing and using tags like <prosody> accompanied, in it's case, by Rate, Pitch or volume.

Using these markers in combination makes it possible to emphasize specific words or to accelerate entire parts of synthesized sentences.

The synthesized voice is small and offers a plain or flat feel to use technical terms. This lack of depth makes it impossible, for example, to create skills involving poetic quotes or voice games.

The <audio> tag will add a new experience layer to your skill.

The two most popular platforms - Amazon and Google - offer their own audio banks that can only be incorporated into the code of their proprietary gadgets, but free audio banks that work on both platforms are quite easy to find on the internet.

There are a few hacks that can brighten your skill designed with TTS:

. Create skills using only pre-recorded voices and integrate these files via <audio>, so whenever you need some extra excitement you won't have to settle for the limitations of the few synthetic voices available on these platforms. This solves, for example, the problem with poetic quotes and storytelling of voice games.

. HACK! Record synthetic voice - a feature available in Google Actions - and treat the file in your favorite audio software, then integrate the sound file again via <audio>, the effect is surprising and quite useful for voice games who was made using only TTS.

. When using sound files, it is good practice to create an introductory skill track so that the user is already familiar with the feature when listening to any incidental effects or tracks.

. The gender marker on some platforms is capable of temporarily changing the gender of your skill voice, giving the impression that we are talking to two different individuals at same time. It is not necessary to explain the usefulness of this feature for the creation of more elaborate games.

SSML Simulators

There are great SSML simulators that can show in practice some of the subjects that were covered in this article, VocalWare is one of the simplest and serves to realize the importance of these markers.

We still living in childhood of voice interfaces, but digital voices based in TTS and STT will offer more immersive and interactive experiences in few years.

If a simple giggle was enough to viralize Alexa videos all over the world, just imagine what good use of SSML can do for your conversational project.

Billy Garcia

AI Researcher

3 年

That day when Srini Janarthanam read an article of mine. ??

1 次回应

要查看或添加评论，请登录

Billy Garcia的更多文章

Diário do metaverso Day#4

2022年4月22日

Diário do metaverso Day#4

Identidade e Gênero no Metaverso A identidade está na vanguarda das conversas sobre tecnologia, tanto por um viés…

1 条评论
Day#2

2022年3月28日

Day#2

Economizei uma pequena fortuna depois que Aurora, minha filha de 7 anos, passou a comprar -principalmente, mas ainda…

5 条评论
Day#1

2022年3月25日

Day#1

Fazem quase três décadas, dedico boa parte do meu tempo ao mapeamento e desenvolvimento de tecnologias para além do…

4 条评论
Free Voice Game Design Canvas (VGDC 0.1)

2021年7月16日

Free Voice Game Design Canvas (VGDC 0.1)

A vers?o aberta do Voice Game Design Canvas que estou disponibilizando ao final deste artigo, assim como, quando…

2 条评论
Design empático é bullshitagem

2021年5月12日

Design empático é bullshitagem

N?o estou dizendo que todo o conceito de empatia no design seja besteira, mas vamos combinar que limitar a vis?o de um…

11 条评论
GPT-3: A era da incerteza.

2020年9月3日

GPT-3: A era da incerteza.

Janeiro calorento em S?o Paulo e n?o haviam ainda quaisquer indícios de Covid ou necessidade de isolamento social:…

2 条评论
SSML: The CSS of VUI.

2019年11月7日

SSML: The CSS of VUI.

This is not a technical article, but speaking from geek to geek I will assume that you know the basics of HTML and…

1 条评论
Conversational Interfaces: Time to Humanize.

2019年11月6日

Conversational Interfaces: Time to Humanize.

Evangelist Billy Garcia travels the country to show you how to effectively walk the journey of building a chatbot or…

3 条评论
SSML, o CSS do VUI design.

2019年8月22日

SSML, o CSS do VUI design.

Este n?o é um artigo técnico, mas falando de nerd para nerd, vou assumir que você conhe?a o básico do HTML e entenda a…

11 条评论
Voice Games: Um novo mercado.

2019年8月5日

Voice Games: Um novo mercado.

A Alexa, da Amazon, é atualmente capaz de executar mais ou menos 80.000 habilidades e já está presente em milh?es de…

5 条评论

See all articles

SSML - The CSS of TTS's

Billy Garcia

AI Researcher

Billy Garcia的更多文章

社区洞察

其他会员也浏览了

Voice XML

Developing a React JavaScript FHIR Editor UI with ChatGPT as Your Coding Assistant

Prompt Engineering

How to use AI in web development?

Botpress Part Two: Brainy Bots for the Discerning Developer (and the Clueless Cousin)

Fuzzy Logic Implementations Using JavaScript

Trust but Verify: Programmatic VLM Evaluation in the Wild

Converting FIGMA UI Designs to React Code Using Small Language Models (SLMs)

The Top Latest Web Dev Tips and Updates

Artificial Intelligence (AI) and PHP

Billy Garcia的更多文章

Diário do metaverso Day#4

Day#2

Day#1

Free Voice Game Design Canvas (VGDC 0.1)

Design empático é bullshitagem

GPT-3: A era da incerteza.

SSML: The CSS of VUI.

Conversational Interfaces: Time to Humanize.

SSML, o CSS do VUI design.

Voice Games: Um novo mercado.

社区洞察

其他会员也浏览了

Voice XML

Developing a React JavaScript FHIR Editor UI with ChatGPT as Your Coding Assistant

Prompt Engineering

How to use AI in web development?

Botpress Part Two: Brainy Bots for the Discerning Developer (and the Clueless Cousin)

Fuzzy Logic Implementations Using JavaScript

Trust but Verify: Programmatic VLM Evaluation in the Wild

Converting FIGMA UI Designs to React Code Using Small Language Models (SLMs)

The Top Latest Web Dev Tips and Updates

Artificial Intelligence (AI) and PHP