登录查看更多内容

Pros and Cons - Comparing different Machine learning algos

Krishna Yogi Kolluru

Data Architect | ML | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | Author of’Why Bitcoin ‘

发布日期: 2019年10月9日

Machine learning is a very exciting field with changes almost on a daily basis and one cannot help but get confused in this flood of information. Having said that, the basics of data science are as relevant today as they are 10 years back ( 10 years is a century in data science :) )

Let's jump right in.

Occam's Razor principle states that we should always start with the least complicated algorithm that can address our needs and build upon complex ones as we see the need.

The simplest of all algos are linear regression, logistic regression

for the mid-level come algos like Naive Bayes, SVM, Knn, Decision Tree Ensembles like Random forest, GBM etc

The most complex ones are Neural Nets and Deep neural nets like MLP, RNN, CNN, LSTM, GRU, Reinforcement Learning and so on.

The irony with 'Deep learning' algorithms is the fact most of these algos have reasonably understandable modeling structures but lack interpretability and hence are called 'black box' models.

Let's get into Pros and Cons of ML algos.

Logistic Regression

Logistic regression is probably the most widely used classification algorithm as it both easy to understand and can work to a wide variety of data sets.

Pros

easy to interpret. the output can be interpreted as a probability unlike decision trees or SVMs
can be easily updated to take in new data ( using online Gradient descent )
can be used for Ranking instead of classification.
good for cases where features are expected to be roughly linear, and the problem to be linearly separable.
can easily "feature engineering" most non-linear features into linear ones.
robust to noise
Regularize your model with L2 or L1 regularization to avoid overfitting (and for feature selection)
you don’t have to worry as much about your features being correlated as you do in Naive Bayes
efficient, and can be distributed(ADMM)
easily adjust classification thresholds and also compute Confidence Intervals

Cons

cannot handle categorical(binary) variables well
suffers from multicollinearity

Lasso(L1)

no distribution requirement
variable selection
suffer multicollinearity

Ridge(L2)

no distribution requirement
no variable selection
does not suffer multicollinearity

Things to watch out!

if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
if the correlations are mostly nonlinear: use SVM
if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge (for weights)

Naive Bayes

Pros

Super simple, you’re just doing a bunch of counts.
If the conditional independence assumption actually holds, a NB classifier will converge quicker than discriminative models like logistic regression, so you need less training data.
And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well.
good for few categories variables

Cons

suffers from multicollinearity
Its main disadvantage is that it can’t learn the interactions between features.

Support Vector Machines

5 years ago, SVMs used to be classified under 'complex algorithms' not anymore :).

SVMs basically try to identify margin between classes by drawing lines ( or hyperplanes). These margins are maximized ( in the process of increasing accuracy), the data points on these margin vectors are called 'support vectors' hence the name. They are also called maximum margin classifiers.

Pros

High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space.
It is really effective in the higher dimension, Even Effective when the number of features are more than training examples.
Popular in text classification problems where very high-dimensional spaces are the norm.
Works superbly when classes are separable
The hyperplane is affected by only the support vectors thus outliers have less impact.
SVM is suited for extreme case binary classification.
use a different loss function (Hinge) from LR, they are also interpreted differently (maximum-margin).
SVM with a linear kernel is similar to a Logistic Regression in practice
if the problem is not linearly separable, use a non-linear kernel (e.g. RBF). (Logistic Regression can also be used with a different kernel)
does not suffer from multicollinearity

Cons

Memory-intensive, hard to interpret, and kind of annoying to run and tune
not for most "industry scale" applications as it doesn't scale well with large datasets.
Does not perform well in case of overlapped classes.
Selecting the appropriate kernel function can be tricky.

Decision Trees - Random Forests & GBM

Decision Trees

Pros

Easy to interpret and explain
They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end).
handle very well high dimensional spaces as well as a large number of training examples.
no distribution requirement
heuristic
good for low categories variables
do not suffer from multicollinearity

Cons

don’t support online learning, so you have to rebuild your tree when new examples come on.
they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in.

Random Forests

train each tree independently, using a random sample of the data, so the trained model is more robust than a single decision tree and far less likely to overfit
2 parameters: number of trees and number of features to be selected at each node.
good for parallel or distributed computing.
lower classification error and better f-scores than decision trees.
perform as well as or better than SVMs, but far easier for humans to understand.
good with uneven data sets with missing variables.
calculates feature importance
train much faster than SVMs
random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs),
They’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs.

Gradient Boosted Decision Trees

build trees one at a time, each new tree corrects some errors made by the previous trees, the model becomes even more expressive.
3 parameters - number of trees, depth of trees, and learning rate; trees are generally shallow.
usually perform better than Random Forests, but harder to get right. The hyper-parameters are harder to tune and more prone to overfitting. RFs can almost work "out of the box" unlike GBM
training takes longer since trees are built sequentially

cons

Compared to RF GBM overfit routinely

Linear discriminant analysis

require normal distribution
not good for few categories variables
compute the addition of Multivariate distribution
compute Confidence Interval
suffers from multicollinearity

Neural Networks

Pros

good to model the non-linear data with a large number of input features
widely used in industry
many open source implementations
only for numerical inputs, vectors with constant number of values, and datasets with non-missing data.
"black box-y", the classification boundaries are hard to understand intuitively( like trying to interrogate the human unconscious for the reasons behind our conscious actions.)

Cons

computationally expensive.
the trained model depends crucially on initial parameters
difficult to troubleshoot when they don't work as expect
multi-layer neural networks are usually hard to train, and require tuning lots of parameters
not probabilistic, unlike their more statistical or Bayesian counterparts. The continuous number output (e.g. a score) can be difficult to translate that into a probability.

Summary

Questions to ask?

the number of training examples, (how large is your training set?)
if small: high bias/low variance classifiers (e.g., Naive Bayes), less likely to overfit
if large: low bias/high variance classifiers (e.g., kNN or logistic regression)
dimensionality of the feature space
is the problem linearly separable?
are features independent?
are features expected to linearly dependent on the target variable?
is overfitting expected to be a problem?
system requirement: speed, performance, memory usage
Does it require variables to be normally distributed?
Does it suffer multicollinearity issue?
Does it work well with categorical variables as well as continuous variables?
Does it calculate Confidence Intervals ?
Does it conduct variable selection without stepwise?
Does it apply to sparse data?

References: Various medium articles

https://christophm.github.io/interpretable-ml-book/logistic.html

要查看或添加评论，请登录

查看全部

Pros and Cons - Comparing different Machine learning algos

Krishna Yogi Kolluru

Data Architect | ML | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | Author of’Why Bitcoin ‘

Logistic Regression

Support Vector Machines

Random Forests

Gradient Boosted Decision Trees

Neural Networks

Summary

更多精彩文章

社区洞察

其他会员也浏览了

How are Jacobian and Hessian matrices used in machine learning?

Artificial Intelligence #5 : A taxonomy of machine learning and deep learning algorithms

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Artificial Intelligence #64: Statistical inference: A good way to understand the mathematical foundations of machine learning

Machine learning vs Statistics

Heatmaps: FiftyOne Computer Vision Tips and Tricks – Oct 6, 2023

Types and Application of Machine Learning Algorithms

Exploring the Most Complex Topics in Data Science and Their Impact on Supply Chain Management

Regression: From Theory to ML

Find out the answer to the question: Which Machine Learning Algorithm Should I Use?!

Logistic Regression

Support Vector Machines

Random Forests

Gradient Boosted Decision Trees

Neural Networks

Summary

Mastering Spark SQL Functions: A Comprehensive Guide

2024年9月2日

100 Data Engineering Jargon That You Must Know

2024年8月27日

Slowly Changing Dimensions in Data Warehouses

2024年8月17日

VectorDB Tutorial — A Beginner’s Guide

2024年7月27日

Databricks SQL Series — Part 5 — Managing and Securing Your Data

2024年7月26日

Databricks SQL Series: Integrating Databricks SQL with Visualization Tools — Part 4

2024年7月26日

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

2024年7月25日

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

2024年7月25日

Databricks SQL Series — Introduction to Databricks SQL — Part 1

2024年7月24日

Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

2024年7月22日

社区洞察

其他会员也浏览了

How are Jacobian and Hessian matrices used in machine learning?

Artificial Intelligence #5 : A taxonomy of machine learning and deep learning algorithms

Unleashing the Power of Data: How Data Engineers Can Harness AI/ML to Achieve Essential Data Quality

Artificial Intelligence #64: Statistical inference: A good way to understand the mathematical foundations of machine learning

Machine learning vs Statistics

Heatmaps: FiftyOne Computer Vision Tips and Tricks – Oct 6, 2023

Types and Application of Machine Learning Algorithms

Exploring the Most Complex Topics in Data Science and Their Impact on Supply Chain Management

Regression: From Theory to ML

Find out the answer to the question: Which Machine Learning Algorithm Should I Use?!