Machine Learning: Classification Models
These days the terms “AI”, “Machine Learning”, “Deep Learning” are thrown around by companies in every industry, they’re the type of words that make any forward-looking executive salivate. You might think these are new concepts that seemed to have appeared overnight, but the reality is they’ve been around for a while and it’s the hard work of many within the field that has really moved it into the spotlight as the latest tech trend. While these terms are sometimes used interchangeably by the media they certainly are not the same, but I’ll leave that discussion for another time.
It’s surely an exciting time for the industry, from a slew of open source libraries (TenserFlow, PredictionIO, DeepLearning4J, or see github) coming into popularity and every cloud provider from Amazon, IBM, Microsoft (the list goes on) all offering their own tools to help get started in the AI/ML/DL field.
If you’ve stumbled on this article, you’re probably well aware of everything I’ve mentioned above, so now that we’ve gotten past the obligatory intro, let's get to what the title actually claims this article is about.
So what are classification models?
A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. Outcomes are labels that can be applied to a dataset. For example, when filtering emails “spam” or “not spam” (also known as “ham”, < seriously, look it up if you don’t believe me), when looking at transaction data, “fraudulent”, or “authorized”
There are two approaches to machine learning, supervised and unsupervised. In a supervised model, a training dataset is fed into the classification algorithm. That lets the model know what is, for example, “authorized” transactions. Then the test data sample is compared with that to determine if there is a “fraudulent” transaction. This type of learning falls under “Classification”.
Unsupervised models, on the other hand, are fed a dataset that is not labeled and looks for clusters of data points. It can be used to search data for similarities, detect patterns, or identify outliers within a dataset. A typical use case would be finding similar images. Unsupervised models can also be used to find “fraudulent” transactions by looking for anomalies within a dataset. This type of learning falls under “Clustering”
Anti-spam uses the Naive Bayes classification algorithm. As people get junk mail, when they mark it as spam the words in that email get put into a database called spam. Good mail goes into the ham (aka not spam) database. Over time the list of spam words and phrases gets built up. Then the anti-spam algorithm can calculate the probability of an email being spam or not spam and make its determination based on that.
There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
Let’s look from a high level at some of these.
Logistic Regression
We already spent some time going over Logistic Regression using Apache Spark and Python. It takes some inputs and calculates the probability of some outcome. For example, if a child has a temperature of 104F (40C) and they have a rash and nausea then the probability that they have chickenpox might be 80%. A rule of thumb in logistic regression, if the probability is > 50% then the decision is true. So in this case, the determination is made that the child has chickenpox.
This is a variation of linear regression, where a model is made to calculate some dependent variable, y, based on some independent variable, x. Then y = mx + b. The model looks for the coefficient m and the y-intercept b. So you end up with some model like the probability of a child having chickenpox could be something like:
p(p) = 0.01 * (temperature) + 0.04 * (nausea or not) + 0.03 (rash or not) - 0.4
Decision Trees
A decision tree is a mechanical way to make a decision by dividing the inputs into smaller decisions. Like other models, it involves mathematics. But it's not very complicated mathematics.
Here is a graphic from the book “Machine Learning” by Tom Mitchell. The goal is to make a decision on whether to play golf based on the combination of temperature, humidity, the wind, and whether it is sunny, cloudy, or raining. For example, if it is sunny, mild, not humid, and not windy then the decision to play golf is “yes.”
The tree is divided into decision nodes and leafs. The leafs are the decisions: yes or no. The nodes are the factors: windy, sunny, etc.
The approach is to look at the decisions and the factors that led to that decision. It is based on the concept of entropy. This looks at the frequency distribution of decisions and then calculates a logarithm. For example, the complete matrix of factors that leads to the decision to play golf is fed into a table. In the Tom Mitchell example, the decision to play golf is yes 5 times and no 9 times. The frequency distribution of yes then is
y = 5 / (5 + 9) = 0.36
and no is
N = 9 / (5 + 9) = 0.64
Notice that y + n = 1. Then the calculation of entropy is made:
e = (-y * ln (y)) + (-n * ln (n)) = 0.94
When entropy is zero then all the answers are the same. The process repeats itself, by dividing each decision into sub-conditions for each decision until entropy is zero. So the next step in the golf example is to look at the decision to play golf when it is sunny. Then look at the decision when it is sunny and windy. And so forth.
Random Forest
This approach to classification is similar to the decision tree, except the questions that are posed include some randomness. The goal is to push out bias and group outcomes based upon the most likely positive responses. These collections of positive responses are called bags.
An example application of the Random Forest model is the algorithm Netflix uses to recommend movies. It looks at people who have similar tastes and then recommends movies that way. It tosses out outlier answers by using randomness to avoid skewing the response in an incorrect direction.
Naive Bayes
As we mention above, Bayes is used, among other cases, to classify email as spam or not. It is based on the concept of dependent probability. There is nothing fancy about Bayes as this is just regular statistics.
If you roll a dice, the probability of getting a 6 is 1/6 since there are 6 sides. What is the probability that you get a 5 after rolling a 6? It’s not any different than if you rolled a 3 or any other number instead of a 5 as those are completely independent events. So the probability of rolling a 6 and then a 5 is (1/6)*(1/6).
Dependent probability is based on what is the chance of some outcome given some other outcome. The chance that an email is spam is based on these three things:
- How many spam messages do you receive over some period of time?
- How many of those contain some word associated with spam, like “lottery.”
- How many emails containing the word lottery were marked by users as spam?
So we want to answer the question: “what is the probability that a newly received email is spam given that it contains the word ‘lottery’?”
That is calculated ((probability of spam given “lottery”) * (probability of spam for all mail)) / (probability of an email containing the word “lottery”).
So those are some examples of classification models. There are others. Modern programming frameworks include these machine language techniques to make it easier for programmers and data scientists to apply this to analytics.
About me
Kirill Fuchs is a passionate developer at Fuzz Productions in Brooklyn, NY. He builds APIs and data-driven applications for clients such as CBS and Anheuser-Busch. Fuzz is a New York based mobile app development company that specializes in designing and developing IOS, Android, and Data Driven applications.
PS: Fuzz is hiring :)