Learning to speak machine
Persistently connected and pervasive devices form an increasing, and arguably ever-more essential part of our future. Our homes can be managed by them, or energy consumption monitored and our environment controlled by them, they help us to quantify our health, our pulse rate, our activity levels, our calorie intake, they keep us connected to the electronic messaging systems we use to stay informed, about news, social comment, the weather, the train timetable, our working lives, our communication with our friends and family, our tribe, our social selves.
This hyper-connected world, with up to 120 billion devices operating within the Internet of Things in next 10 years has been born from alignments of multiple fields. Technological advancements in voice recognition, immersive audio technology, artificial intelligence and haptics are resulting in novel and increasingly intuitive device interactivity.
Early methods of speech recognition aimed to find the closest matching sound label from a discrete set of labels. Deep neural networks produced significant error reduction from this. The maximum vocabulary size for large speech recognition has increased substantially since 1976. In fact, for real-time natural language dictation systems since the late 1990s the vocabulary size essentially became unlimited. That is, the user was not aware of which relatively rare words were in the system's dictionary and which were not. The systems tried to recognize every word dictated and counted as an error any word that was not recognized, even if the word was not in the dictionary. Big data has enabled voice recognition due to the huge amount of information available to refine and improve the acoustic model and the language model. Research firm Markets and Markets predicted in 2016 that the speech recognition market would reach $10 billion by 2022.
Immersive sound takes the traditional multi-channel surround experience to a whole new level by using height or presence channels (speakers mounted high on walls or a ceiling) to create a dome of sound. The resulting “3D” audio effect is absolutely stunning and ushers the listener that much closer to experiencing artificial sound that mirrors real-world auditory experiences. While traditional channel-based audio can be used to create immersive sound, a new type of audio encoding called “object-based audio” is proving to be highly effective. Object-based audio allows Hollywood moviemakers to program an audio mix that pinpoints sounds to specific areas of space within a room. Properly equipped AV Receivers (AVRs) can decode a film’s object-based encode and use available speakers to best replicate sound placement as intended by the filmmaker. It means that when a gun fires, when a character laughs – it feels like they’re right there in the room with you. In business, immersive sound can combat open plan office issues, making it much easier to communicate and collaborate with those that aren’t in the office.
In Danielle Reid’s article for Medium “The Future Interface: Design to Make Technology Human” she describes how this coalescence of interface technologies changes the way we can interact with the pervasive technologies around us.
“When we can design across systems, devices will evolve or disappear. We will not stop consuming content, rather consume it in a way that suits human behavior, not the other way around.
Future interfaces deliver content in more meaningful, relevant ways. Content can stimulate all of our senses and be delivered at contextually relevant times.
Audio is expanding not only the internet, but also our relationship with the content itself.
The value of services delivered via pervasive and connected devices will be immense, and has the potential to directly affect both developed and developing countries. In their 2016 report “The Internet of Thins: Mapping the Value Beyond the Hype” consultancy firm McKinsey forecast the global economic value of IoT services to be between $4 trillion and $11 trillion by 2025. The sectors most likely to be directly impacted by pervasive personal devices – Human, Health, Vehicles and Offices will together deliver between $550 billion and $3 trillion in value over the same period.
Overall, McKinsey’s analysis suggests that over 60% of the value derived from IoT services will be delivered to the developed world. Despite higher numbers of potential deployments in developing countries the economic benefit, either in consumer surplus, customer value gained from efficiency savings or technology spend, will disproportionately be delivered to developed economies. The disparity is even more stark in terms of the categories where pervasive and connected devices are likely to be the medium of delivery or interface to IoT services. In the Human category – which covers monitoring and managing illness and wellness – almost 90% of the value derived will be in developed nations, partly in recognition of the fact that health-care spending in developed economies is twice that of developing economies.
Pervasive and responsive devices have the potential to transform our understanding of, and interactions with, the world:
Case study: Dopple
doppel works by creating a silent vibration on the inside of your wrist which feels just like the ‘lub-dub’ of a heartbeat. By affecting your brain's perception of your heart rate, doppel changes how you feel. In trials, doppel's slow setting reduced stress, and it's faster setting increased focus. It works using a recognised scientific phenomenon - synchronisation of the sympathetic and para-sympathetic autonomic nervous system.
The heartbeats of a mother and baby will sychronize with one another when they interact closely, and similar effects have been observed in couples. Research also shows that when an empathetic partner holds the hand of a woman in pain, their heart and respiratory rates sync and her pain reduces. And you don’t even have to be touching. Researchers also found that if you sit a couple face-to-face and ask them not to talk, just staring at each other for fifteen minutes is enough to get their levels of skin conductance and heart rate to sync up.
Our bodies also respond to non-biological rhythms. For example, the temp of a song can naturally alter our breathing rate and heart rate - and in fact, researchers in Sweden found that not only can choir singers harmonize their voices, they can also synchronize their heartbeats.
But unlike music, or holding someone’s hand, doppel’s silent beat is non-distracting, so you can feel calm and focused, anytime and anywhere. Researchers from the Psychology Department at Royal Holloway, University of London assessed the calming effects of doppel and found that its heartbeat-like vibration delivered onto the inside of the wrist can make the wearer feel significantly less stressed.
The research was published on 24 May 2017 in the the peer-reviewed journal Nature Scientific Reports.
In a controlled, single-blind study, two groups of participants were asked to prepare a public speech - a widely used psychological task that consistently increases stress. All participants wore the device on their wrist and a cover story was used to suggest to participants that the device was measuring blood pressure during the anticipation of the task. Importantly, for only one of the two groups of participants, the device was turned on and delivered a heartbeat-like vibration at a slower frequency than the participants’ resting heart rate, while they were preparing their speech.
The researchers measured both physiological arousal and subjective reports of anxiety. The use of doppel had a tangible and measurable calming effect across both physiological and psychological levels. Only the participants who felt the heartbeat-like vibration displayed lower increases in skin conductance responses and lower anxiety levels.
And developing the capability to have more human-centric interfaces to computing power is already showing the potential to help isolated and marginalised groups of users in healthcare settings – in research published in the British Journal of Psychiatry in 2013 a group of medical professionals and research scientists detailed their work enabling patients suffering from schizophrenia with persecutory auditory hallucinations to use voice interaction with digital avatars to successfully engage in dialogue with, and ultimately manage or entirely dissipate their persecutory voices.
One in four patients with schizophrenia currently fails to respond to antipsychotics, and remains troubles by persistent auditory hallucinations which have a major impact on their lives and can lead to suicide.
When asked about the worst aspect of hearing voices, sufferers invariable response is one of helplessness – but patients who can initiate a dialogue with their voice feel much more in control. In the research work delivered patients were able to choose and adapt the physical facial manifestation of their voice, and also able to control the pitch and tone of the voice. Responses were provided by a therapist in an adjacent room, whose aim was to enable the dialogue to give the patient the opportunity to stand up to the avatar, allowing its responses to progressively come under the patient’s control.
All of the patients who were in the therapy receiving group (vs the control group) benefitted from significant reductions in the frequency and intensity of the voices, the voices also were perceived to be less malevolent and omnipotent. After three-month follow-up there was a significant reduction in depressive symptoms. Three of the 16 patients who received therapy, who had been suffering from auditory hallucinations for 16, 13 and 3.5 years respectively, reported that their voices ceased entirely, and had not returned at the three-month stage.
But, does the attractiveness and seductiveness of our apparent ability to engage with these pervasive devices have a more challenging side? What are the implications for pervasive listening devices and privacy issues? What are the ethical considerations we should be levelling at the providers of voice-recognition and AI platforms? If a home assistant “hears” a violent struggle – should the police be called? What if it “hears” a fall or the sounds of someone in pain – should it alert medical providers? Do we “speak” differently to listening devices – and if we do speak differently to them does that new mode of speaking transfer into other communication spheres? Does learning to speak machine make us also speak differently to each other?
And are we confident that, even if we develop the ethical, procedural and privacy controls that mean we can feel comfortable with pervasive listening devices engaging in communication with us, the human side of the equation will prove as rule-abiding as the machines or that we will even continue to be able to understand the exchanges we are making, or that the machines are making with each other? Microsoft, Google and Facebook have all been experimenting with AI-generated communication and voice interaction – with results ranging from the obscene, hate-crime-referencing abuse taught to Microsoft’s AI avatar TAY by exposure to the Twitter-sphere, to the more benign but ultimately endless-loop ramblings of Vladimir and Estragon, the Google Home chatbots (not the Beckett characters), and the development by Facebook’s negotiation AIs of their own language, evolved from written English but ultimately indecipherable to their programmers.
A voice-user-interface (VUI) to an AI-enabled processing platform represents an unprecedented degree of abstraction of programming commands. It doesn’t even feel like it’s a computer you are engaging with. That’s part of what makes VUI-AI so extraordinarily accessible to so many, even those who might otherwise have struggled to engage with technology. It is a sure-fire way to bridge the digital divide, but we have the potential to end up in the position of a massively inexperienced consumer base using highly advanced technology with an ever-reducing requirement to actually learn machine-level programming languages at all. Voice-based interaction is utterly fundamental to our species – as Lera Boroditsky’s work on language at Stanford demonstrates. In the conclusion of her summary in Edge.com, Lera explains:
“I have described how languages shape the way we think about space, time, colors, and objects. Other studies have found effects of language on how people construe events, reason about causality, keep track of number, understand material substance, perceive and experience emotion, reason about other people's minds, choose to take risks, and even in the way they choose professions and spouses. Taken together, these results show that linguistic processes are pervasive in most fundamental domains of thought, unconsciously shaping us from the nuts and bolts of cognition and perception to our loftiest abstract notions and major life decisions. Language is central to our experience of being human, and the languages we speak profoundly shape the way we think, the way we see the world, the way we live our lives.”
Language is the transactional currency of social animals. Vocalisation, pack animal behavioural codes, and for humans and some other complex mammals like elephants language allows for a development of social bonding and interconnectedness that can span significant physical distances. Unlike peer-grooming which requires immediate physical proximity, language allows for social group coherence at distance, tens of km for elephant herds, right across the worlds of space and time for humans.
Our ability to use language, meaning and nuance in the negotiation of infinitely subtle social contexts is one of the reasons that “speaking machine” appears so superficially attractive but quickly proves so dissatisfying. It’s the reason why so many smart-speaker devices, despite their significant processing capability and an ever burgeoning set of “skill” extensions, are used as an expensive and comparatively mediocre-quality music player. Computers don’t understand language. When we learn to “speak machine” we adopt verbal cues in fixed or pre-ordained pattern ranges of command selection. We have become our own verbalised command line.
Computers are extraordinarily capable of processing massive volumes of data and performing almost unimaginably complex calculations within specific parameters. Humans are extraordinarily capable of the adaptive processing and synthesis of entirely novel sources of information in alien contexts.
We will all get better at “speaking machine”, and they will get much, much better at listening to us. But we shouldn’t let the beguiling responsiveness of technology limit our passion and enthusiasm for speaking to each other – or allow the apparent shrinking of communication distances to hold back our natural instinct to search out and experience the diversity and variety of our world.
Caroline Gorski, April 2020