Most upright algorithm for classification – Decision tree or Logistic regression
Classification is one of the crucial problems we need to solve while working on business problems in data science. In this article, we will discuss about solving this problem using the two main techniques – Decision trees and Logistic regression. Generally, people are most concerned about which one to apply for a classification problem but most of the time the responses are "it shall depend on the business context". Both the techniques can be widely used for predictive modeling cases such as Churn prediction, survival, and response.
Today we will see the differences and application of each. And what are the business contexts you shall evaluate.
Decision trees are non-parametric while Logistic regression is a parametric method. What it means is they have different approaches for resolution and distinct rates of convergence. The one which minimizes the cost function will be the one that has sufficiently converged to its error rate. Decision trees are mostly used if you want to understand the decision rules to create your segments. It works best if you have lots of categorical variables or even a small number of them with each having significant number of levels. Logistic regression on the other hand works well on smooth numeric predictors. In logistic regression, all you need is the probability to predict a class.
Decision trees have an underlying assumptions that the decision boundaries are parallel to the axes and hence they chop up the feature space into rectangles or even squares of some sort. They naturally extend up to creating some complex functions and resulting with the issues of over-fitting. This is not the case with Logistic regression wherein they assume a smooth linear decision boundary. They identify the probabilities using the maximum likelihood approach using its own logit model. Decision trees are likely to be a good fit where you have small number of features and plenty training data points. This is due to the curse of dimensionality as we can lose a lot of information by merely picking a reduced subset. Logistic regression is rather a better fir for high dimensional data.
Here are some of the advantages of these techniques over one another
Decision Trees
- They have very intuitive decision rules
- They are less sensitive to the outliers in the data
- Decision trees can handle non-linear functions
- It works very well with variable interactions in the data
Logistic Regression
- Logistic regression can handle multi-collinearity very well with its ability to perform L2 regularization
- It gives convenient probability values for the observations
- They have widespread industry comfort for classification problems and efficient implementations in other tools as well
Both the algorithms are very fast with respect to its execution time. Logistic regression works better if you have a single decision boundary that segregates different attributes whereas decision trees are efficient if you have multiple boundaries. Logistic Regression although being the most prevalent algorithm for resolving scaling issues in industry problems, its losing ground to other techniques with their implementation efficiency.
Both the techniques discussed are very efficient in its own ways and helps to classify your responses appropriately. Which one is most suitable, is your choice.