Deep Learning for Machine Empathy: Robots and Humans Interaction?—?Part I
Nelson Fernandez-Pinto
GenAI at Air Liquide | I write about LLMs and Diffusion models
When we think about the imminent development of the next digital revolution, humanity will face an unprecedented wave of automation. More and more smart and connected devices will coexist with us. This revolution is already taking place, from cell phones, to autonomous vehicles and even our refrigerator. Something is for sure, robots are already here, and they are here to stay.
The question is not whether we agree, but how we will interact with these new tenants. Beyond the classic principles of design, as utility and style, a new criterion will gain relevance: machine empathy. This tendency will become stronger as more companies understand that human-machine interaction (HMI) is key to secure technology adoption.
But, what can we do to improve human-machine interactions? Can we at least soften our coexistence?
The key to social integration is to master the ability to understand what other people feel and think, and react accordingly. Until now, this capacity has been reserved only for (some) humans. This virtue called empathy improves socialisation, and humans are sociable by nature.
So the answer could be to give machines the ability to understand how we feel, what we need and what our goals are. So they can react accordingly maximising our comfort. This also includes giving them the correct form. Will this new generation of robots be humanoids? Gentle automata like a harmless Roomba? Or perhaps terrifying as the Black Mirror’s Metalhead robot ‘dogs’ and their real life distant relatives from Boston Dynamics. This is part of a whole when discussing about HMI.
Many researchers have worked on this field, particularly the Humanoid Robotics Group at MIT. They developed Kismet, a social robot. Kismet kindly reacts to the emotions shown by its viewers, engaging people in natural and expressive face-to-face interactions. Preliminary results show a great improvement in the interaction between humans and these machines.
It is clear the success of this new wave of incoming automation will depend to a large extent on the empathy and personality of the robots. Imagine a car that detects that you are felling sad, and automatically plays a song that you love to make you feel better, or a robot medical assistant that recognises your needs and reacts to give you maximum attention and comfort. By adding powerful automatic speech recognition and natural language processing (extensively developed by Amazon Alexa and others) the possibilities are endless.
Such a system could be fed by external sources of information, making it evolve based on experience. Your device will continuously learn from you. This hyper personalisation will have a direct consequence: uniqueness. Uniqueness is the fuel of attachment, and attachment is intrinsically human.
In the science fiction movie Real Steel (2011), Atom, the boxer robot suffers serious damage several times during combat. Suddenly, emotions begin to appear, as an obvious sign, we don’t want to lose Atom; it is unique. We know what made Atom so special compared to other robots, it showed feelings; it was empathetic.
But don’t worry, at that time cloud storage and telecommunications technology will be so developed, that there is little chance of losing your robot’s personality.
It is not clear how this could change the technology industry and affect consumer habits. Would you change your car as frequently as you did before? Would you have the impression that your device is unique? Will you get to bond with it?
The reality is that we still do not have answers to these questions. This revolution is beginning, and its potential consequences are not yet fully understood. Then this topic will be part of an open discussion in the upcoming years.
Deep Learning and Emotion Recognition
Emotion recognition is the first step into the journey to have real “empathetic” machines. This kind of system has been successfully built using Deep Learning architectures, specifically Convolutional Neural Networks (CNN).
The secret behind this success is the ability of CNNs to automatically learn to identify relevant low and high level characteristics of the input images. The network generates an increasingly explicit image representation, learning to combine low and high level features to finally care about the actual content, instead of individual pixel information. This final representation is used to perform the classification of the emotions in several categories, such as, sadness, joy, anger, fear, surprise and neutrality. A very detailed explanation of this can be found in the notorious paper ‘A Neural Algorithm to Transfer Style’.
The following video is the result of a one-week immersion in real-time emotion recognition using Deep Convolutional Neural Networks. To test the solution, we chose the famous ‘Sad Ben Affleck’ interview. The preliminary results are shown here (more improvement is coming):
In the next post we will go directly to the implementation of this basic (and functional) empathy module for robots based on Deep Learning. We will delve into computer vision techniques, ranging from classical fast face detection algorithms to Deep Neural Networks for emotion recognition and transfer learning.
Questions
I hope you enjoyed this publication as much as I did while doing it. I will be happy to read your opinion on this topic on the comments. If you want to know more about Axionable, our projects and careers please visit our website and follow us on Twitter.
Bibliography
Breazeal, C. (2000), “Sociable Machines: Expressive Social Exchange Between Humans and Robots”. Sc.D. dissertation, Department of Electrical Engineering and Computer Science, MIT.
Gatys, L.A., Ecker, A.S., & Bethge, M. (2015). A Neural Algorithm of Artistic Style. CoRR, abs/1508.06576.
GenAI at Air Liquide | I write about LLMs and Diffusion models
6 年Hi Sherya, Thanks for your comment. Yes, you're right, facial expressions could be deceptive. As I see it, machines would have to infer your state of mind not only from visual but also from auditive feeds (including voice tone, pitch and intensity). More context is also needed: what does the interlocutor do/need? Age, gender and interests. This is part of a very extensive topic that I plan to cover in the following articles along with the implementations. For now I’m focussed on exploring visual content. I hope that answer your question :) Nelson