A note on Machine Learning, from a real noob...
So, let me preface this by saying I'm just dipping my toes into the very deep waters of computer vision algorithms and machine learning. So, please take all this with an unhealthy dose of salt! Still, outside (noob) perspectives are useful in their own right as is sharing the learning process. So, deficiencies sufficiently bared, I wanted to share some thoughts about Machine Learning that have been eye-opening for me.
Machine Learning, like all great technology, is deceptively simple.
You JUST have to teach a computer how to learn! Think about that a bit though... You don't have to teach kids how to learn, they just do. Walking, talking... we have to teach them those things by example. But, we don't have to teach them how to learn. Our brains are massively parallel learning machines - they're hardwired to learn.
This is, of course, precisely why people doing the really cutting edge stuff with machine learning (Google, Facebook, Tesla, Nvidia, etc...) are making custom or semi-custom hardware to speed things up - and it is massively parallel hardware like GPUs, or your brain. But, I digress... Point being, writing software for machine learning and then writing "learning code" to make that software into a self-improvement machine is really really hard, even though those who do it well make it seem really really simple on the surface. So, yeah. Bear that in mind.
Machine learning can be applied at different scales
What I mean is that one can go really deep and have the machine learning tools actually craft and refine their own algorithms to some extent. With this kind of "deep learning" the main boundary still imposed is the structure of the "artificial neural network" behind the scenes. This approach gets you stuff like two Facebook chat-bots inventing their own language to communicate more effectively with each other. Yes, that happened. This is where specialized hardware really comes into its own allowing cars to learn how to drive just like kids learn how to walk. Which is a really good analogy - except that a car running custom hardware taught itself to drive in like 8 hours... imagine your kid doing that for walking! Yikes...
In practical day-to-day applications, that is far less common though as the specialized hardware is not cheap, and that level of software customization is even more expensive! It also requires an exponentially larger amount of data to "train" the system than simpler approaches. A simpler approach would be applying machine learning to fixed algorithms in an attempt to refine their thresholds. This is how most day-to-day machine learning is handled. Personalizing word prediction while you're texting for instance, or Target knowing your teen daughter is pregnant before you do...
In these cases, the underlying algorithms (more on that in a bit) remain largely unchanged, but through machine learning the thresholds that flip the switch one way or another can be adjusted without direct human guidance to attain much higher accuracy than would likely be achieved by a person's relatively myopic stab at the answer. This still requires a lot of data of course, just not the petabytes that deep learning requires...
Data mining is required
Data is, of course, the key to enabling machine learning. Doing this for personalization is, at least, less fraught with privacy concerns. With things like text prediction it can be done locally without transmitting your preference for misspelling the word "mispelling" for all your friends to mock. However, most broader uses (including creating the baseline for personalized prediction) requires sharing that data with the world.
Now, I don't want to spin off into a privacy discussion as that would take up the rest of the article. Suffice to say, there are ways to sanitize data (decouple it from your distinct identity) that goes a long way to ameliorating my concerns with sharing data. It may or may not do the same for you. :-) I think it is also a good policy to always allow individuals to opt-out of data sharing, though it is also fair that opting out may limit the functionality of the program in some way. If you aren't allowing that connection necessary for sending your data up, then that connection isn't open to receive either. That isn't globally true or due to technologically insurmountable issues, but it is fair.
Without sharing, so many of the conveniences of modern technology wouldn't be possible. So, I think the debate should be focused on what is the proper and legal way to collect and share data rather than whether we should share it at all - that ship pretty much sailed right after the internet did, y'all. And, after all, sharing is caring.
You need the right answer
Whether it is deep learning, or "shallow" learning you're employing, the training/teaching process requires someone who knows what the right answer is. While theoretically this could be another more developed machine learning algorithm, for at least a little while this is definitely a human-guided task. So, yay! We all will still have jobs in 20 years! We'll all just be teaching computers how to do what we used to do without them!
There are some great examples of this out there. reCAPTCHA using website security captchas to digitize the words from scans of old books that OCR failed on, while feeding those correct human-guided answers back into the OCR system to improve its reliability. Tesla recording the human-guided driving of people to help train their autopilot systems. Or, Target using the historical shopping history of known-pregnant women to tune their pattern analysis algorithms to predict who to send pregnancy related product coupons to.
Once you have a lot of data and have a way to train the system using a source of right and wrong answers, it's time to start feeding back into your algorithms...
A brief brief on Algorithms
Let's take a moment to talk about algorithms. This is a word that sometimes puts people's backs up because it sounds like you might need a degree to understand it. You don't. An algorithm is just a defined set of steps and rules for solving a specific problem. A simple example - you have a dead end corridor and want to know if it passes code. The code states that a dead end corridor can't be longer than 20' from a fire exit. We can write an "algorithm" for that:
- Measure the length of the hallway
- If it is longer than 20' it fails
- If it is 20' or less, it passes
We just wrote an algorithm. Yes, it is a stupidly simple one. I have limitations, people. Of course, we also need a 5' turning radius at the end of the corridor to pass accessibility code, so let's add a test for that. We also need to make sure that doors in that corridor don't impede that turning radius. So, add a test for that... Pretty quickly you can have a lot of steps and a more complex algorithm to process. Also, not all algorithms have clear yes/no conditions. Many algorithms are based on confidence factors.
In the Target pregnancy kerfuffle, 25 products were identified as having correlation to pregnancy and even certain stages of pregnancy. However, almost all of those products might also have been bought for general non reproduction related purposes. The quantities weighted in, and purchasing multiple of those products did signal a high likelihood. Hidden in this algorithm are a lot of thresholds:
- How much of each product is "a lot"? (buying the large size of a non-scented lotion, or buying multiple of them, or buying a large size three weeks in a row?)
- How many different flagged products must be purchased to signal significance? (If you buy three different products is that enough? Or, do you need to buy at least one pregnancy test AND prenatal vitamins, and then buy lotion in bulk?)
- How is each product weighted in the confidence metric? (should pregnancy tests and branded pre-natal vitamins count the same as non-scented lotion? Or less? Or are some products a "requirement" and others used for confirmation or narrowing the due-date?)
- At what point is the confidence metric sufficiently high to warrant sending out coupons or baby-shower registry information? (70%? 80%? 90%? You want to limit the occurrences of angry dads showing up to yell at the store managers after all!)
Those are the kinds of thresholds that would have been adjusted by a human in the past using our intuition, moderate capacity for pattern recognition, and quite limited capacity for data absorption and retention. Another way of thinking of these things is as tuning pegs on a string instrument. To make beautiful music, all the strings need to be tuned right. If the tunings are simple and there aren't a ton of strings, people can be pretty good at it (like a guitar). Tuning a piano is a much harder thing to do. For algorithms with inter-related thresholds, imagine trying to tune a piano where tightening one string loosens another one somewhere else... Yikes! Call out the computers. These are the kinds of things that machine learning even on a normal computer can be really good at refining. So...
Back to Machine Learning
With all that in mind, I've been working on some different rules of thumb to help guide me in thinking about how to apply machine learning to the software products I'm responsible for. Our tools are heavily laden with some very complex computer vision algorithms with a lot of strings to tune, so it would be pretty helpful!
What and how much data do I need?
This question is something that has to be answered with a specific optimization of an algorithm in mind. You'll need the refined inputs (what feeds into the algorithm) and you'll need the correct answer it should come up with. And you'll need a lot of those. A whole lot. Not like 10, or even 1000. Like 100,000. Or 100,000,000 This is why machine learning isn't just thrown at any old problem, and why most algorithms start off being human-tuned. You need a certain baseline amount of information encompassing most of the conditions you want to test for. Now, you may be fine with certain outliers failing - no algorithm is perfect (100% reliable). But, you'll likely have a target reliability for your algorithm, and that will definitely give you a clue as to how much data you'll need...
Now, you'll probably notice that the reliability axis starts at 50%. If your algorithm is wrong more often than it is right... you probably need to change the algorithm before you mess with tuning it through machine learning. At <50% you'd statistically be better off flipping a coin. To really nail it on reliability you'll need exponentially more data than you will to just do it pretty well. I don't have any numbers up there because, well, this is a relationship chart. And, the amount of data you need is actually dependent upon another critical factor...
Not too surprisingly, complex algorithms with lots of tunings, variables, or fringe conditions are going to require a lot more data - or you'll have to intentionally choose to ignore certain conditions in the interests of simplifying your algorithm and your life. In our corridor example, hallways aren't always straight. They can jog, or be at an angle, or be curved, etc... So, you can either try and write a more complex algorithm that encapsulates those conditions by defining a path of travel and measuring the length of that curve OR you can just say - only use my fancy compliance tool on straight hallways.
If you do the former, you'll need a lot more data so you can test and refine your algorithm over each of those conditions. If you do the latter, then you'll need a lot less data for the tuning process - but people might not find your tool as useful. Making that decision on a real-world software product is a complex balancing act for the developers, product managers, and customers using the tool. Part of the job though. :-)
Looking at those two charts though, it is pretty obvious why Tesla or Google need hundreds of millions of data points to feed into their very complex algorithms for self-driving cars that must be extremely reliable. You've got to be more reliable than a human driver, and that means getting way up there into the 99.999% range to keep the lawyers and the insurance providers happy.
How to get started on that road...
One of the first things I heard from my colleagues at ClearEdge3D was this little phrase:
The hallmark of good AI is that it gets out of the way and makes it easy for a human to intercede when the computer fails...
This is where microsoft failed with our dear friend Clippy. It was annoying because it just didn't realize when it was making it harder to work rather than easier. Getting it to shut up was a lost cause, so pretty much everyone just disabled it. With automation algorithms the same principle is true. If you're starting with a human tuned algorithm you've got to make it easy for the operator to intervene when your algorithm goes wrong. If you do this well though, you're in a beautiful position to collect data on what the right answer actually is - just like reCAPTHCA did.
At that point you still have to work out a fair, legal, and equitable way to collect that data, then figure out how to set up the machine learning system and hook it into your tuning pegs for your automation algorithm, and then hope you have the computing power to get a result in the time necessary to make it useful! And while details and execution are very important, knowing what you need and laying the foundation to get it is at least half the battle. And, if you can really fire on all cylinders and get everything connected right - you can make substantial headway into near-human or even better than human reliability with the problem your software is trying to solve or automate. And that's pretty darn cool.
(For any ClearEdge3D customers reading this article, we are not currently collecting any usage or product data in any of our software products, nor do we have a mechanism to do so. If we ever do go down that road with any of our new or existing software; there will be a terms of use change notifying you of that intent, and there will always be an opt-out allowing you to keep your data 100% private - that's just good practice after all.)
Great article Kelly Cone! Really well crafted.