7 Prediction Algorithms Explained - in English
I’m no expert, not a Data Scientist, just a fan, learning some basics and sharing what I’m learning so if you’re an expert and want to correct me or challenge me in a debate over the theory of these items, I defer to you in advance!
The ability and advancement of the open source community to make knowledge, tools and support for everyone to be able to learn and implement practices like these is amazing to me. Just 5-10 years ago knowledge like this would be guarded in a corporate R&D facility but thankfully the democratization of IT knowledge has reached the tipping point and we’re all the better for it.
As an economics student many years ago I was fascinated by how models could reflect and complex outcomes. This led me to explore the current state of predicative analytics and Data Science and I was surprised to find out it’s not that scary or hard. I chose the Programming Language R path and found that once you understand the basics the bevy of applicable packages (freely shared) do most of heavy lifting for you.
As a Sales Professional I was interested in learning how code can predict sales outcomes and efforts. My research led to several different types of algorithms available for making predictions. Each ones with pros and cons, strengths and weaknesses depending on your use case.
With increasing and predicting sales as my preferred use case I’ll give you a layman’s explanation of several of my favorites and for the purpose of this post and for explaining each I will use the sample use case of the imaginary and ever so creative “Bob’s Fruit Stand”
I assume, through extrapolation you can assume Bob wants to sell more fruit. I’m going to offer the technical definition of each (don’t freak out) then explain it so I can understand it.
1. K-Means Cluster
What to say to scare people: K-Means clustering is an unsupervised learning method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Or you can quote:
BTW – R does all this math for you!
Bob’s Fruit Explanation: If Bob plots all of his apple and orange sales by day on a chart he’ll have a bunch of random dots. I he picks 2 random points on the plot and calculates the minimum distance of all the dots to either point, then to the optimal “cluster center” he’ll get 2 clusters with each with minimum distance to their cluster epicenter. So, how does this sell more fruit?
If Bob knows that apples & oranges (k) sell better on the 3rd Tuesday of each month (n) he can have more inventory and maximize profits and reduce inventory on slower days and reduce waste.
- Pros - Fast, Efficient withy large numbers of variables, Explainable
- Cons - You have to know K, Outliers effect cluster centers and can effect outcome negatively
- Primary Uses - Primary grouping of data for other algorithms, Geographic clustering
2. Na?ve Bayes
What to say to scare people: Na?ve Bayes are highly scalable probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
Or you can quote:
Again, R does all this math for you!
Bob’s Fruit Explanation: Bob wants to know the probability of a female customer buying apples P(A) so he can decide who to offer volume deals to. He knows he had 100 customers last month and 30 of them bought apples and the demographic if his town is 60% female. Gender and fruit purchased become predictor events (B1 & B2). P(A/B1-B2) gives Bob percent probability that his next female customer will buy apples.
- Pros - Simple and fast, Works will with noisy or incomplete data, Provides probability outputs, Good with categorical data
- Cons - Lower accuracy, Expects predictors to be autonomous, Not good with large numeric sets,
- Primary Uses - Medical diagnosis, Spam filtering, Document classification, Sporting event predictions
3. C4.5 Decision Tree
What to say to scare people: If C4.5 doesn’t do it alone say it’s is an extension of Quinlan's ID3 statistical classifier information entropy algorithm, and walk away.
Or you can quote:
Did I mention R does all this math for you?
Bob’s Fruit Explanation: Bob wants to build a decision tree to predict which customers he should offer home delivery to. He only want to go 5 miles and only want to offer the service to his biggest fruit buyers. So his decision tree would be:
- Does customer (on average) buy more than $100/Mo - Y/N
- If yes, does customer live in Zip Code 38483 - Y/N
- If yes then they’re a good candidate for the service!
- Pros - Easy to understand, Works with missing data, Sensitive to local variation, Fast
- Cons - Lower accuracy, Subject to bias, Not good with numerous predictors
- Primary Uses - Credit approvals, If you need to explain decision, Categorization
4. Random Forest
What to say to scare people: Random forests are an ensemble learning method for classification and regression that operate by constructing a multitude of decision trees at training time, while applying overfitting, random feature selection and bagging, and outputting the class that is the mode of the classes or mean prediction of the individual trees.
Or you can quote:
You know, R does all this math for you?
Bob’s Fruit Explanation: Bob’s going to need some help from a Data Scientist with this one. In this his buddy the Data Scientist, we’ll call him Marvin, will take all of Bob’s data, run some correlation programs to see what data appears to indicate different results. He builds a random forest and it indicates “when it’s over 90 degrees you sell 3 times the watermelon and when it’s below 50 degrees you sell 3 times the pecans” Now Bob can adjust inventory based on the weather forecast. The multiple decision trees in the forest check the data multiple ways to find the most accurate representation of predictions.
- Pros - Very accurate, Works well with multiple predictors, Parallelizable, Good with missing data
- Cons - Time and resource intensive, Bias to disproportionate variables
- Primary Uses - Scientific research, Competition Outcome Predictions, Medical Diagnosis
5. Apriori / ARM - Market Basket
What to say to scare people: Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
Or you can quote:
You know, R does all this math for you!
Bob’s Fruit Explanation: This is the most popular type as it’s the type Amazon uses to predict that if you buy mason jars you might also like fruit canning books. It’s also called Market Basket Analysis.
N is number of transitions. X,Y and Z are bananas, kiwi and grapefruit.
Support measures how frequently a combination of items occurs.
- Support (X)= Count(transactions with X/N
- Support (X,Y)= Count(transactions with X & Y/N
Confidence measures expected probability if a customer buys bananas they will also buy Kiwi.
- Confidence (X->Y)=support(X,Y)/Support (X)
Lift measures how many more time bananas and kiwi are bought together than expected
- Lift (X->Y)=confidence (X->Y)/Support (X)
Now Bob knows when a customer buys bananas and doesn’t see kiwi that he should suggest them and odds are they will buy them.
- Pros - Easy to understand, Many applications, Simple to apply
- Cons - Not good with missing data, Moderate accuracy
- Primary Uses - Shopping, Recommendations
There are 2 more but Bob’s not ready for these. He got woozy when Marvin mentioned the them but for your sake:
6. Artificial Neural Network
Scary: Artificial neural networks (ANNs) are a family of models inspired by biological neural networks and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. Artificial neural networks are generally presented as systems of interconnected "neurons" which exchange messages between each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning.
Marvin: Inspired by how the brain works. Discovers complex correlations hidden in data. Works well with noisy and diverse data. Use cases include artificial intelligence and machine learning
7. Support Vector Machines
Scary: A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
Marvin: Based on vector geometry and statistical learning theory. Used for highly complex relationships. Use cases include facial recognition and bioinformatics.
*Citations - Wikipedia & V2 Maestros - Applied Data Science Udemy
System Analyst/Architect at Onshore Outsourcing
8 年Nice article. Thanks, Steve.