登录查看更多内容

7 Prediction Algorithms Explained - in English

Steve Bowman

Risk Modeling - Cybersecurity - M365 Infrastructure Experts

发布日期: 2016年4月28日

I’m no expert, not a Data Scientist, just a fan, learning some basics and sharing what I’m learning so if you’re an expert and want to correct me or challenge me in a debate over the theory of these items, I defer to you in advance!

The ability and advancement of the open source community to make knowledge, tools and support for everyone to be able to learn and implement practices like these is amazing to me. Just 5-10 years ago knowledge like this would be guarded in a corporate R&D facility but thankfully the democratization of IT knowledge has reached the tipping point and we’re all the better for it.

As an economics student many years ago I was fascinated by how models could reflect and complex outcomes. This led me to explore the current state of predicative analytics and Data Science and I was surprised to find out it’s not that scary or hard. I chose the Programming Language R path and found that once you understand the basics the bevy of applicable packages (freely shared) do most of heavy lifting for you.

As a Sales Professional I was interested in learning how code can predict sales outcomes and efforts. My research led to several different types of algorithms available for making predictions. Each ones with pros and cons, strengths and weaknesses depending on your use case.

With increasing and predicting sales as my preferred use case I’ll give you a layman’s explanation of several of my favorites and for the purpose of this post and for explaining each I will use the sample use case of the imaginary and ever so creative “Bob’s Fruit Stand”

I assume, through extrapolation you can assume Bob wants to sell more fruit. I’m going to offer the technical definition of each (don’t freak out) then explain it so I can understand it.

1. K-Means Cluster

What to say to scare people: K-Means clustering is an unsupervised learning method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Or you can quote:

BTW – R does all this math for you!

Bob’s Fruit Explanation: If Bob plots all of his apple and orange sales by day on a chart he’ll have a bunch of random dots. I he picks 2 random points on the plot and calculates the minimum distance of all the dots to either point, then to the optimal “cluster center” he’ll get 2 clusters with each with minimum distance to their cluster epicenter. So, how does this sell more fruit?

If Bob knows that apples & oranges (k) sell better on the 3rd Tuesday of each month (n) he can have more inventory and maximize profits and reduce inventory on slower days and reduce waste.

Pros - Fast, Efficient withy large numbers of variables, Explainable
Cons - You have to know K, Outliers effect cluster centers and can effect outcome negatively
Primary Uses - Primary grouping of data for other algorithms, Geographic clustering

2. Na?ve Bayes

What to say to scare people: Na?ve Bayes are highly scalable probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Or you can quote:

Again, R does all this math for you!

Bob’s Fruit Explanation: Bob wants to know the probability of a female customer buying apples P(A) so he can decide who to offer volume deals to. He knows he had 100 customers last month and 30 of them bought apples and the demographic if his town is 60% female. Gender and fruit purchased become predictor events (B1 & B2). P(A/B1-B2) gives Bob percent probability that his next female customer will buy apples.

Pros - Simple and fast, Works will with noisy or incomplete data, Provides probability outputs, Good with categorical data
Cons - Lower accuracy, Expects predictors to be autonomous, Not good with large numeric sets,
Primary Uses - Medical diagnosis, Spam filtering, Document classification, Sporting event predictions

3. C4.5 Decision Tree

What to say to scare people: If C4.5 doesn’t do it alone say it’s is an extension of Quinlan's ID3 statistical classifier information entropy algorithm, and walk away.

Or you can quote:

Did I mention R does all this math for you?

Bob’s Fruit Explanation: Bob wants to build a decision tree to predict which customers he should offer home delivery to. He only want to go 5 miles and only want to offer the service to his biggest fruit buyers. So his decision tree would be:

Does customer (on average) buy more than $100/Mo - Y/N
If yes, does customer live in Zip Code 38483 - Y/N
If yes then they’re a good candidate for the service!

Pros - Easy to understand, Works with missing data, Sensitive to local variation, Fast
Cons - Lower accuracy, Subject to bias, Not good with numerous predictors
Primary Uses - Credit approvals, If you need to explain decision, Categorization

4. Random Forest

What to say to scare people: Random forests are an ensemble learning method for classification and regression that operate by constructing a multitude of decision trees at training time, while applying overfitting, random feature selection and bagging, and outputting the class that is the mode of the classes or mean prediction of the individual trees.

Or you can quote:

You know, R does all this math for you?

Bob’s Fruit Explanation: Bob’s going to need some help from a Data Scientist with this one. In this his buddy the Data Scientist, we’ll call him Marvin, will take all of Bob’s data, run some correlation programs to see what data appears to indicate different results. He builds a random forest and it indicates “when it’s over 90 degrees you sell 3 times the watermelon and when it’s below 50 degrees you sell 3 times the pecans” Now Bob can adjust inventory based on the weather forecast. The multiple decision trees in the forest check the data multiple ways to find the most accurate representation of predictions.

Pros - Very accurate, Works well with multiple predictors, Parallelizable, Good with missing data
Cons - Time and resource intensive, Bias to disproportionate variables
Primary Uses - Scientific research, Competition Outcome Predictions, Medical Diagnosis

5. Apriori / ARM - Market Basket

What to say to scare people: Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Or you can quote:

You know, R does all this math for you!

Bob’s Fruit Explanation: This is the most popular type as it’s the type Amazon uses to predict that if you buy mason jars you might also like fruit canning books. It’s also called Market Basket Analysis.

N is number of transitions. X,Y and Z are bananas, kiwi and grapefruit.

Support measures how frequently a combination of items occurs.

Support (X)= Count(transactions with X/N
Support (X,Y)= Count(transactions with X & Y/N

Confidence measures expected probability if a customer buys bananas they will also buy Kiwi.

Confidence (X->Y)=support(X,Y)/Support (X)

Lift measures how many more time bananas and kiwi are bought together than expected

Lift (X->Y)=confidence (X->Y)/Support (X)

Now Bob knows when a customer buys bananas and doesn’t see kiwi that he should suggest them and odds are they will buy them.

Pros - Easy to understand, Many applications, Simple to apply
Cons - Not good with missing data, Moderate accuracy
Primary Uses - Shopping, Recommendations

There are 2 more but Bob’s not ready for these. He got woozy when Marvin mentioned the them but for your sake:

6. Artificial Neural Network

Scary: Artificial neural networks (ANNs) are a family of models inspired by biological neural networks and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. Artificial neural networks are generally presented as systems of interconnected "neurons" which exchange messages between each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning.

Marvin: Inspired by how the brain works. Discovers complex correlations hidden in data. Works well with noisy and diverse data. Use cases include artificial intelligence and machine learning

7. Support Vector Machines

Scary: A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

Marvin: Based on vector geometry and statistical learning theory. Used for highly complex relationships. Use cases include facial recognition and bioinformatics.

*Citations - Wikipedia & V2 Maestros - Applied Data Science Udemy

Mary McKinnis

System Analyst/Architect at Onshore Outsourcing

8 年

Nice article. Thanks, Steve.

要查看或添加评论，请登录

Steve Bowman的更多文章

My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 2

2024年9月6日

My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 2

The site signage and small museum were a wealth of (what I thouhgt was) cool information so thought I'd share - plenty…

1 条评论
My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 1

2024年9月3日

My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 1

The small island of Iona is located off the western coast of Scotland. Iona has been on my "bucket list" for over 20…

2 条评论
Azure Automanage + Azure Arc = More Secure + A Lot Less Work!

2023年5月25日

Azure Automanage + Azure Arc = More Secure + A Lot Less Work!

In our lab, in less than 5 minutes, we’ve deployed the following to our VM’s and Azure Arc enabled on-premises…

6 条评论
Five Ways to Modernize Your Infrastructure for the Coming Year (and Spend that Lingering IT Budget)

2020年10月2日

Five Ways to Modernize Your Infrastructure for the Coming Year (and Spend that Lingering IT Budget)

As the old saying goes, “An ounce of prevention is worth a pound of cure.” Looking to the coming year, what headaches…
7 reasons why Business People should care about Blockchain Smart Contracts

2017年5月22日

7 reasons why Business People should care about Blockchain Smart Contracts

What you see in the image above is my weekend homework: writing peer-to-peer, decentralized blockchain smart contracts.…
PREDICTING THE “NEXT BIG THING” IN TECHNOLOGY—THE DIRTY LITTLE SECRET GARTNER DOESN’T WANT YOU TO KNOW

2016年5月18日

PREDICTING THE “NEXT BIG THING” IN TECHNOLOGY—THE DIRTY LITTLE SECRET GARTNER DOESN’T WANT YOU TO KNOW

I’ve debunked the Gartner’s Emerging Technology Hype Cycle by plotting results from 2008-2014 and you won’t believe…

3 条评论
8 Sorting Algorithms explained

2016年4月2日

8 Sorting Algorithms explained

Sorting algorithms allow people to navigate, learn from and use large structured and unstructured data sets in their…

1 条评论

See all articles

7 Prediction Algorithms Explained - in English

Steve Bowman

Risk Modeling - Cybersecurity - M365 Infrastructure Experts

Steve Bowman的更多文章

社区洞察

其他会员也浏览了

Introduction to Machine Learning for Beginners

Learning made easy ????

Unsupervised Learning training

Professional Career Path Certificates of Ricardo Fonseca after Phd. defense in 2016

Building Your AI Literacy: A Framework-Based Approach

Bad Students Make Great Teachers: Active Learning Accelerates Large Scale Visual Understanding

This 3-minute mindset shift can make learning data science fun—let me show you.

Unsupervised Learning training

Literacy skills are AI skills.

LLM use in the classroom

Steve Bowman的更多文章

My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 2

My Journey to Iona: A Pilgrimage of Reflection and Renewal - Part 1

Azure Automanage + Azure Arc = More Secure + A Lot Less Work!

Five Ways to Modernize Your Infrastructure for the Coming Year (and Spend that Lingering IT Budget)

7 reasons why Business People should care about Blockchain Smart Contracts

PREDICTING THE “NEXT BIG THING” IN TECHNOLOGY—THE DIRTY LITTLE SECRET GARTNER DOESN’T WANT YOU TO KNOW

8 Sorting Algorithms explained

社区洞察

其他会员也浏览了

Introduction to Machine Learning for Beginners

Learning made easy ????

Unsupervised Learning training

Professional Career Path Certificates of Ricardo Fonseca after Phd. defense in 2016

Building Your AI Literacy: A Framework-Based Approach

Bad Students Make Great Teachers: Active Learning Accelerates Large Scale Visual Understanding

This 3-minute mindset shift can make learning data science fun—let me show you.

Unsupervised Learning training

Literacy skills are AI skills.

LLM use in the classroom