Pros and Cons - Comparing different Machine learning algos
Krishna Yogi Kolluru
Data Architect | ML | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | Author of’Why Bitcoin ‘
Machine learning is a very exciting field with changes almost on a daily basis and one cannot help but get confused in this flood of information. Having said that, the basics of data science are as relevant today as they are 10 years back ( 10 years is a century in data science :) )
Let's jump right in.
Occam's Razor principle states that we should always start with the least complicated algorithm that can address our needs and build upon complex ones as we see the need.
The simplest of all algos are linear regression, logistic regression
for the mid-level come algos like Naive Bayes, SVM, Knn, Decision Tree Ensembles like Random forest, GBM etc
The most complex ones are Neural Nets and Deep neural nets like MLP, RNN, CNN, LSTM, GRU, Reinforcement Learning and so on.
The irony with 'Deep learning' algorithms is the fact most of these algos have reasonably understandable modeling structures but lack interpretability and hence are called 'black box' models.
Let's get into Pros and Cons of ML algos.
Logistic Regression
Logistic regression is probably the most widely used classification algorithm as it both easy to understand and can work to a wide variety of data sets.
Pros
- easy to interpret. the output can be interpreted as a probability unlike decision trees or SVMs
- can be easily updated to take in new data ( using online Gradient descent )
- can be used for Ranking instead of classification.
- good for cases where features are expected to be roughly linear, and the problem to be linearly separable.
- can easily "feature engineering" most non-linear features into linear ones.
- robust to noise
- Regularize your model with L2 or L1 regularization to avoid overfitting (and for feature selection)
- you don’t have to worry as much about your features being correlated as you do in Naive Bayes
- efficient, and can be distributed(ADMM)
- easily adjust classification thresholds and also compute Confidence Intervals
Cons
- cannot handle categorical(binary) variables well
- suffers from multicollinearity
Lasso(L1)
- no distribution requirement
- variable selection
- suffer multicollinearity
Ridge(L2)
- no distribution requirement
- no variable selection
- does not suffer multicollinearity
Things to watch out!
- if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
- if the correlations are mostly nonlinear: use SVM
- if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge (for weights)
Naive Bayes
Pros
- Super simple, you’re just doing a bunch of counts.
- If the conditional independence assumption actually holds, a NB classifier will converge quicker than discriminative models like logistic regression, so you need less training data.
- And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well.
- good for few categories variables
Cons
- suffers from multicollinearity
- Its main disadvantage is that it can’t learn the interactions between features.
Support Vector Machines
5 years ago, SVMs used to be classified under 'complex algorithms' not anymore :).
SVMs basically try to identify margin between classes by drawing lines ( or hyperplanes). These margins are maximized ( in the process of increasing accuracy), the data points on these margin vectors are called 'support vectors' hence the name. They are also called maximum margin classifiers.
Pros
- High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space.
- It is really effective in the higher dimension, Even Effective when the number of features are more than training examples.
- Popular in text classification problems where very high-dimensional spaces are the norm.
- Works superbly when classes are separable
- The hyperplane is affected by only the support vectors thus outliers have less impact.
- SVM is suited for extreme case binary classification.
- use a different loss function (Hinge) from LR, they are also interpreted differently (maximum-margin).
- SVM with a linear kernel is similar to a Logistic Regression in practice
- if the problem is not linearly separable, use a non-linear kernel (e.g. RBF). (Logistic Regression can also be used with a different kernel)
- does not suffer from multicollinearity
Cons
- Memory-intensive, hard to interpret, and kind of annoying to run and tune
- not for most "industry scale" applications as it doesn't scale well with large datasets.
- Does not perform well in case of overlapped classes.
- Selecting the appropriate kernel function can be tricky.
Decision Trees - Random Forests & GBM
Decision Trees
Pros
- Easy to interpret and explain
- They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end).
- handle very well high dimensional spaces as well as a large number of training examples.
- no distribution requirement
- heuristic
- good for low categories variables
- do not suffer from multicollinearity
Cons
- don’t support online learning, so you have to rebuild your tree when new examples come on.
- they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in.
Random Forests
- train each tree independently, using a random sample of the data, so the trained model is more robust than a single decision tree and far less likely to overfit
- 2 parameters: number of trees and number of features to be selected at each node.
- good for parallel or distributed computing.
- lower classification error and better f-scores than decision trees.
- perform as well as or better than SVMs, but far easier for humans to understand.
- good with uneven data sets with missing variables.
- calculates feature importance
- train much faster than SVMs
- random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs),
- They’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs.
Gradient Boosted Decision Trees
- build trees one at a time, each new tree corrects some errors made by the previous trees, the model becomes even more expressive.
- 3 parameters - number of trees, depth of trees, and learning rate; trees are generally shallow.
- usually perform better than Random Forests, but harder to get right. The hyper-parameters are harder to tune and more prone to overfitting. RFs can almost work "out of the box" unlike GBM
- training takes longer since trees are built sequentially
cons
- Compared to RF GBM overfit routinely
Linear discriminant analysis
- require normal distribution
- not good for few categories variables
- compute the addition of Multivariate distribution
- compute Confidence Interval
- suffers from multicollinearity
Neural Networks
Pros
- good to model the non-linear data with a large number of input features
- widely used in industry
- many open source implementations
- only for numerical inputs, vectors with constant number of values, and datasets with non-missing data.
- "black box-y", the classification boundaries are hard to understand intuitively( like trying to interrogate the human unconscious for the reasons behind our conscious actions.)
Cons
- computationally expensive.
- the trained model depends crucially on initial parameters
- difficult to troubleshoot when they don't work as expect
- multi-layer neural networks are usually hard to train, and require tuning lots of parameters
- not probabilistic, unlike their more statistical or Bayesian counterparts. The continuous number output (e.g. a score) can be difficult to translate that into a probability.
Summary
Questions to ask?
- the number of training examples, (how large is your training set?)
- if small: high bias/low variance classifiers (e.g., Naive Bayes), less likely to overfit
- if large: low bias/high variance classifiers (e.g., kNN or logistic regression)
- dimensionality of the feature space
- is the problem linearly separable?
- are features independent?
- are features expected to linearly dependent on the target variable?
- is overfitting expected to be a problem?
- system requirement: speed, performance, memory usage
- Does it require variables to be normally distributed?
- Does it suffer multicollinearity issue?
- Does it work well with categorical variables as well as continuous variables?
- Does it calculate Confidence Intervals ?
- Does it conduct variable selection without stepwise?
- Does it apply to sparse data?
References: Various medium articles
https://christophm.github.io/interpretable-ml-book/logistic.html
?