What I learned this week - supervised and unsupervised learning
Photo by Immo Wegmann on Unsplash

What I learned this week - supervised and unsupervised learning

Once you've prepared and cleansed your data, most data science modelling techniques fall into two categories. Supervised techniques or Unsupervised techniques. They can look wildly different, suited to different kinds of problems and very different ways of solving them. I want to understand what they are well enough that I can look at a business problem, and have a good idea whether as a data problem that is supervised or unsupervised.

What is this trying to solve?

What is a supervised or unsupervised technique? It's better to think of the category of problem. Supervised is really clear and similar to the kind of problems you're used to solving with computers: "I already know what good looks like, and I want the computer to spot that for me." We might want to know "is this customer high value or low value?" or "how long is this person going to stay a customer of mine?" They're definable outcomes: the first based on an amount spent and a threshold we set, the second a number of months.

Unsupervised problems don't already have a "good" condition. Or really any end conditions that we know in advance at all. I just want the computer to look at the data and tell me what it found. "What product is best fit for this person?" for example. We haven't told it how the products match to people, just sent it on its merry way.

It's easier said than done to know what good looks like. The existing data has to have a metric in it that we can use for that. To label some outcomes as "yes" and some outcomes as "no", or to tell it that higher is better or lower is better. There needs to be some way that the historical data measures success, so that we can train it to predict future outcomes.

Last week I discussed regression and I think that's a great example of supervised techniques. We calculate the relationship between the dependent variable and the independent variables, so that we can say for each hypothetical future state: "that's what the dependent variable will probably do." We have a numerical target and we have "supervised" the model to let us predict that target variable.

But something we do a lot at Profusion that seems to fit as unsupervised, is clustering. Like bottom-up persona generation. Instead of building a persona first then finding the customers who fit, clustering lays out every customer on a big sheet of paper positioned based on their datapoints (age, gender, location, purchase history, etc). Then it works to find where customers are grouped, and tells you "These are the best, tightest groups that I could find and the typical datapoints defining that group."

A note about neural networks

A note here about Neural Networks. A lot of the time when we talk about machine learning we're talking about a neural network. They're really good at some kinds of supervised learning, because we can feed in input data and score it on how well it matches the output data. Again and again until it's super accurate at matching those outputs for new data too.

Picture a series of miniature decision making nodes. Each one is unsupervised. The node is basically just a little bit of code (complex code) that says "give me access to all the data, and I'll give you an output."

No alt text provided for this image

Then those outputs are passed onto more nodes as inputs, and the new nodes say "cool, I'll give you an output too." The outputs can be kind of meaningless. Not necessarily especially connected to the inputs, at least not in ways that would make sense. At this stage it doesn't really even matter if they're connected to the inputs.

But then, and this is where the supervised learning bit comes in, you tell the whole network of nodes "I want a single output. From input A, I want output X." It won't work. So you get it to change something in the nodes. It doesn't matter what, but it'll be pre-programmed to tweak things a little bit at random. You try again. And again.

Do this a million times. Each time it gets closer, keep the thing that got it closer and change something else. Ditch what hasn't worked, stick with what has.

It's an ugly brute force method, but really smart programming can reduce that, can build shortcuts. Can favour the kind of changes that typically help. It helps to have a specific kind of network, where the nodes have been programmed based on understanding this kind of challenge.

Eventually you've created something that can reliably get to Output X when you give it Input A. But that's only part of the problem. Because you test it with Input B, hoping for Output "Not X".

So you do it a million more times. So that A gives you X, and B gives you Not X.

Finally you will have trained a neural net that can classify A or B into X or Not X. That's a classification problem, and we've solved that using a supervised neural network.

How does that solve the problem?

This allows much more complex inputs than our regression analysis. Because the input could be any range of things, not just numbers or dummy variables. It might be a JPEG file. A JPEG is a coded map to what colour and brightness each pixel is. But that kind of map doesn't mean much to a computer, it can display it but not understand it. You'd need to program "here are all the ways to detect and edge. Here is the difference between a new object and a shadow." But a neural network doesn't care, it just sees that input code and can run it through the nodes until it gets you the output you want.

So now you can interpret or classify some really complicated things!

What doesn't it handle well?

Obviously they aren't perfect or the best solution to every problem. The way they interpret information is totally alien, and isn't based on something you'd recognise. Data Scientists say you can't interpret how a neural network actually makes decisions. That you sacrifice interpretability for accuracy. But why?

It's because each node might actually offer nothing that gets you closer to the answer. You've brute forced your way into making the right decisions for the wrong reasons. The first node might say "If A and B are above 10, but C minus D are less than A squared, the output is a 1. Otherwise the output is an ascii art of a shovel." It's kind of random. The next node might say "Ah, he's given me Shovel again. That means I need to write a letter to the Chancellor of the Exchequer about pigs, and send that to the next node!" None of it will make sense.

Since we only supervised the network as a whole to give us the answer we wanted, each node was sort of unsupervised. No node is explicit on how it contributes to getting us to the answer, we just know that when they're put together, they do.

Can you make this easier?

Neural networks are a form of machine learning, but the category of machine learning is much larger. There are many other ways to produce systems that will, in supervised or unsupervised manners, tell us more about our data.

A lot of work has already been done on how to build neural networks or other machine learning tools. So typically when you're building machine learning models, you use those tools that already exist. You might need to train your network, but hopefully you never have to program the nodes. If you're lucky, you can use pre-trained models that already know what good looks like and the right relationships to get there. You're just adding new input data to make good predictions.

What next?

In future entries I expect to talk more about machine learning, but looking into neural networks definitely helped me to clear up the difference between supervised and unsupervised learning in my own mind, and what kind of problem they're suited to, without having to memorise a list of problem types.

要查看或添加评论,请登录

Alistair Dent的更多文章

  • What I learned this week - Data mining

    What I learned this week - Data mining

    For most of my adult life, data mining has been this sort of holy grail of companies trying to become more data-driven.…

    2 条评论
  • What I learned this week

    What I learned this week

    What am I doing? Nearly three years ago I moved from performance marketing agencies to a data consultancy. Profusion…

    4 条评论
  • The future of artificial intelligence lies in bots. This is why that matters.

    The future of artificial intelligence lies in bots. This is why that matters.

    A real life assistant is expected to be a jack-of-all-trades. Whether finding a room for my meeting (and knowing how…

    3 条评论
  • On Cookies

    On Cookies

    There are a variety of reasons that cookies don't persist for very long: user deletion, device resets, corporate…

    2 条评论
  • Safe Harbours are Irrelevant

    Safe Harbours are Irrelevant

    There has been a lot written over the last couple of days about the safe harbour agreement the EU and US share: the…

  • Paid search over the next five years

    Paid search over the next five years

    Paid search is a channel in constant upheaval. At no point since AdWords launched has there been a period without huge…

  • The Future of Feeds: Understanding Structured Data

    The Future of Feeds: Understanding Structured Data

    Cross-posted from Think with Google "The future is already here. It's just not evenly distributed.

    2 条评论
  • An open letter to ad serving companies

    An open letter to ad serving companies

    Proponents of ad blockers are right: the web can be a terrible experience, especially on mobile. Here’s how to make it…

    1 条评论
  • What Google I/O Means for Brands

    What Google I/O Means for Brands

    Google's developer conference this year announced several new products and some updates to familiar ones. Google I/O…

    4 条评论
  • The ECJ, Google, and the Right to be Forgotten

    The ECJ, Google, and the Right to be Forgotten

    On the 18th February Julia Powles and Enrique Chaparro published an article in The Guardian discussing Google's role in…

社区洞察

其他会员也浏览了