Bring on the Bots: Deep Learning & Conversing
Google created quite the stir recently with the newest party trick their Google Assistant platform can now do; call a hairdresser and book a haircut for you! Admittedly the Duplex AI voice system is still a prototype, and knowing this I must wonder how flexible the technology is at this stage, but nonetheless it is mightily impressive.
There's even cause to surmise this machine could pass a Turing test soon; wherein a person converses with both a machine and a real person simultaneously in the blind and cannot tell which one is which. The person who picked the phone up on the other end certainly didn't seem to realise!
Make no mistake, the machines are coming, and they want your hair follicles.
The amazing part of this though is the nuanced approach to conversation this prototype has. I've used Siri, Alexa and Google Assistant ever since each have been launched; I have several Echo devices, an Android Phone, an iPad, and a Google Home in the post. I also trigger commands from my computer, running Windows 10, using the "Hey Cortana" command all the time.
I'm very much at ease with voice triggered and chat/message based interfaces, and I have noticed a gargantuan leap forward in these technologies since the early days of voice operated commands built in to car computers.
It's still far from perfect; I asked Alexa what the opening hours were for Brent Cross Shopping Centre yesterday only to have the response "Playing songs by Ben Foster on Spotify" closely followed by the theme from Thunderbirds. Even accounting for my Leeds born Yorkshire drawl, the initial flash of frustration was palpable as I repeatedly shouted "Alexa Stop!" over a reverse counting of 5 through 1 interspersed with dramatic orchestral stabs.
Muddled intents aside most modern interfaces still have stilted speech and jarring linguistic patterns that put these platforms right at the bottom of the uncanny valley; some that the Duplex system hopped across with ease and grace. However, it can only do so because it's learned how humans communicate by listening, learning and improving, much like a child does.
Our old friend, Data Science, has come to the rescue once again; and more specifically, Deep Learning.
What is Deep Learning
I've mentioned this mechanism in my writing before, so I think it's appropriate I give a brief overview of what Deep Learning is and what makes it so powerful for this kind of thing.
The first thing to make clear is the term covers a range of techniques; there's not one way to do this. There are Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), Long Short-Term Memory (LSTM) Networks, and even Generative Adversarial Networks (GAN) which pit two Neural Networks against one another; one trying to fool the other into believing the outputs it's generated are real when placed alongside training data, the other pointing out what it got wrong when it doesn’t so it can make improvements.
Irrespective of the technique used they all have this in common; they are designed to detect which features have a profound effect on each other by analysing a data set as it evolves over time, and use those features to make predictions about future behaviours depending on what the model was designed to predict, always learning from those instances where it gets the answer wrong. That automated feedback and adjustment loop is what differentiates this family of processes from standard statistical models; which are static and unchanging.
In a sentence, it's Machine Learning, but without the need for a Data Scientist to define what features to feed into the model; it makes that judgement for itself and learns from its mistakes.
A feature in this context is an attribute a data record has; a property. In those scenarios where these attributes are labelled (such as CSV's, documents, JSON's, database tables) then a supervised learning mechanism is your best approach.
But fair warning now, supervised learning is just the tip of the iceberg; one with enough Maths hiding beneath the surface to make the Titanic look like a toothpick.
Supervised, Semi-Supervised & Unsupervised Learning
I expect some of you may be thinking something like this;
"Oh great, he's going to drown us in equations…"
Or perhaps;
"Supervised learning? What, like teaching? In a classroom? Is this what people mean by training a model?"
Alternatively, it may be this;
"Yes, yes, we know what all this is you cretin, get on with it will you…"
Either way I'll try and satisfy all camps; no, I'm not going to drown you in maths as I want this to be an accessible read (as much as this sort of thing can be), and yes this is kind of like training a model - you are training it to find features that are significant to feed into your Deep Learning Model, but it's not the only way to do so.
For the "shut up and get on with it" camp, patience is a virtue - imagine your compiling some code or something and I'll give you a shout when I'm done. I'll try and keep this as short and to the point as possible.
Supervised
As I said, Supervised learning is best used when all the attributes of the record have labels; you know what each value means and its wider context. To give an example, someone's height, age, place of birth or any other defining feature of their customer record could be selected as a relevant feature to include in the model should the learning processes value them as statistically significant.
If Age and Gender directly correlate to a longer Membership Term and Higher Spend per Annum then the output of their Customer Lifetime Value Score generated by the Deep Learning Network could be higher. You may not have known this going in, but the system has found a correlation between features with these labels on the training datasets and as such classified them as good features for the model. Over time, new labelled features may become more relevant, and be introduced or, indeed, old ones removed.
A Multi-layered Perceptron is particularly good for this as it can classify attributes with non-linear relationships - not just straight lines through graphs, but a variety of curves and gradients to decide which attributes relate when in direct comparison. These networks use weighting to give those perceptron's that correctly classify outputs against a test data set a larger proportion of the final aggregated output, thereby coming to a decision following trial and error over thousands and thousands of iterations in a very short space of time.
In the interest of keeping this "un-mathsy" I want to go back to my example earlier in the section. The classification could run like this;
1. Does age correlate strongly with higher spend per annum > MLP checks against test data and judges no > relationship has a low score
2. Does longer membership term correlate strongly with higher spend per annum> MLP checks against test data and judges yes > relationship has a very high score
3. Deep Learning Model selects both longer membership term and higher spend per annum as a feature for its model to predict Customer Lifetime Value as the both rank highly was having a relationship.
This is a very simple example, but one I'm hoping illustrates the guts of how it works. The goal in this case is to classify which attributes correlate strongly to the output value and which do not, enabling the Deep Learning Network (DLN) to make the right choice of features autonomously at speeds far beyond that which a human could.
You can play with multi-layer perceptron networks on this site - it's great for helping to understand how this work in reality - https://playground.tensorflow.org
Unsupervised Learning
Unsupervised learning does much the same thing; however, the context of the data is completely unknown. This method is used to define features in free text, spoken conversations, videos, photographs or other loosely defined structures.
What do I mean by loosely defined? Well a face, for example. As humans we understand exactly what a face is instinctively, and react accordingly. This decision is made based on a whole group of attributes that individually have no context but together are obviously a face.
We learn what a face is over time, and refine it to the point that we can identify individual faces the more we are exposed to them. The visual cortex is a powerful beast, and it dominates our perception of the world, but if you were to describe every single facet that makes you certain a face is real this is harder to pin down.
This is why realistic mannequins give us the creeps; they're almost real but not quite right - we don't know why exactly at that second but we just do. These facets may be different for each example, but that slight drift is enough; back into the uncanny valley we go!
Unsupervised learning tries to work the same way. Using a training set it builds up a view of which data points relate to each other strongly or weakly without knowledge of what each point is. The most common way this is done is by Clustering.
By building clusters, unsupervised systems can group these unlabelled attributes together calculating the closeness of the relationship as a distance value. This distance value could be calculated on the gap between the two instances of one attribute, or the distance of multiple attributes together as an aggregate. The smaller this number, the closer the records are, and the tighter they fit in the cluster.
Confused yet? It took me a little while to wrap my head around it when I first studied the subject; I'll try and make it real.
Imagine you've taken a survey of a 100k people, and the answers from that survey were all rankings of one through ten. Each survey had some basic identity questions plus 10 of these ranking answers.
However, in their haste to get the results, your survey team fed the list of questions asked to the office cat ‘Princess’, because some idiot spilt a tin of John West Tuna across its now briny pages. You only have the answers. Researchers eh?
After smashing your head into a wall for 30 minutes, chased with several gin & tonics, all whilst simultaneously regretting your hiring decisions and lamenting the peculiar lunch choices of those in your employ, you turn to the data. You've no idea what any of it means beyond the obvious demographic data - name, address, age range, gender, job title.
Because of this you decide to run a K-means cluster analysis. The rationale is simple; you've no idea how any of these could relate to each other as you have no context, so you can't perform any feature engineering or classify these attributes. A K-means cluster operation uses the variance, the distance, from the same value for other customers to position a value within the graph, creating clusters of customers with similar values around a mean value (known as K).
Fran may be close to Barry but far from Edna, who's but a bit closer to Dean etc.
This is the logical choice to try and detect patterns in the results themselves; the hope being it will segment the respondent base into discreet groups as well as highlight which of the unnamed features contributed most to the successful clustering operation.
You run your analysis and derive an aggregate distance value for each record in relation to each other. This tells you which records are statistically similar, and after a few passes we have success! You have found 2 distinct groups. Without even knowing what these dimensions are for, you have determined how the respondents relate to one another statistically and segmented your list.
A DLN will do much the same, but automatically, in parallel and at breakneck speed; It will then feed these data points into the next stage as features.
There is a plethora of different cluster flavours; DBSCAN, EM, GMM - each with different use cases.
Those ranking numbers in the survey example could easily be words, sentences, descriptions, love letters, photos of the aforementioned cat tucking into its impromptu fish supper; the overall method would be nearly identical.
Data is, as data does.
Semi-Supervised Learning
Ok, so I'm hoping all the rubbish in the previous two sections made at least a modicum of sense; god knows it took me long enough to write it...
There is one more slice of joy in this pie of wonder, and I think by now many of you have probably had a guess at what the final piece of this pastry looks like.
Semi-supervised learning is, as expected, somewhere in the middle. Cluster analysis is good for this too as you can absolutely run cluster analysis over structured/labelled datasets to pick out patterns you would have no clue to even investigate, or to run analysis across compound datasets made up of both structured and unstructured assets.
If this is starting to sound like a typical scenario for a large data driven org then I think you owe yourself another beverage; this type of initial feature analysis is perfect for those businesses with mixed estates of legacy structured databases, maybe some interim XML and JSON assets and streams of free text or machine code that defies obvious understanding or interpretation.
The reality is most applications of Deep Learning Networks use Semi-Supervised learning processes when making decisions about which features to feed into the DLN in the first place.
The world is imperfect and so is our data.
Features Done, What Now?
So, we have our features, somehow. What happens next? Well, that all depends on what the purpose of your model is; what are you trying to Predict? Understand? Achieve?
The business outcome will determine the design and nature of the network you create; as I said earlier, Deep Learning is a catch all term covering a myriad of different model types and methods; one size does not fit all. Your choice need not be restricted to one model though, multiple model types can be chained together or executed in parallel as ensembles, with the result merged or aggregated at the end.
For instance, RNNs are commonly used in image processing, whereas LSTMs are great for Natural Language Processing. If you need to interpret images to extract words and then process them in order pull out their meaning, then an ensemble of multiple model types makes complete sense.
Conveniently, that paragraph leads us back to the beginning, and how we are moving towards natural human centric communication with machines at breakneck speed.
Much of this would be impossible without advances in Deep Learning Network design, the maths to make it real, and a vast array of supporting libraries / frameworks such as TensorFlow, WordNet and spaCy.
But I think one inescapable driving force behind this shift in user experience is our own growing comfort with the medium. I remember as recent as 5 or 6 years ago being very much in the minority when using conversational interfaces - whether these be tapped in a window or spoken aloud.
Now, even my dad has an Amazon Echo.
There’s a positive reinforcement loop in play; thanks to Deep Learning these interfaces are becoming smarter the more we converse with them. In turn, we use them more as they understand us better, again leading to further improvements in their accuracy and perpetuating the cycle or engagement, adoption and enhancement.
The lines are beginning to blur between what is considered ‘real’ and ‘artificial’ - it will be interesting to see how humanity adapts as technology becomes relatable, more ‘human’ in its interactions with us.
Turing’s test is about to be put to task by a hockey puck that’s wants you to have a haircut.
What a delightfully weird and wondrous time, indeed.