登录查看更多内容

Most upright algorithm for classification – Decision tree or Logistic regression

Akhil Sakhardande

data science tech lead manager, AdTech/MarTech

发布日期: 2015年12月3日

Classification is one of the crucial problems we need to solve while working on business problems in data science. In this article, we will discuss about solving this problem using the two main techniques – Decision trees and Logistic regression. Generally, people are most concerned about which one to apply for a classification problem but most of the time the responses are "it shall depend on the business context". Both the techniques can be widely used for predictive modeling cases such as Churn prediction, survival, and response.

Today we will see the differences and application of each. And what are the business contexts you shall evaluate.

Decision trees are non-parametric while Logistic regression is a parametric method. What it means is they have different approaches for resolution and distinct rates of convergence. The one which minimizes the cost function will be the one that has sufficiently converged to its error rate. Decision trees are mostly used if you want to understand the decision rules to create your segments. It works best if you have lots of categorical variables or even a small number of them with each having significant number of levels. Logistic regression on the other hand works well on smooth numeric predictors. In logistic regression, all you need is the probability to predict a class.

Decision trees have an underlying assumptions that the decision boundaries are parallel to the axes and hence they chop up the feature space into rectangles or even squares of some sort. They naturally extend up to creating some complex functions and resulting with the issues of over-fitting. This is not the case with Logistic regression wherein they assume a smooth linear decision boundary. They identify the probabilities using the maximum likelihood approach using its own logit model. Decision trees are likely to be a good fit where you have small number of features and plenty training data points. This is due to the curse of dimensionality as we can lose a lot of information by merely picking a reduced subset. Logistic regression is rather a better fir for high dimensional data.

Here are some of the advantages of these techniques over one another

Decision Trees

They have very intuitive decision rules
They are less sensitive to the outliers in the data
Decision trees can handle non-linear functions
It works very well with variable interactions in the data

Logistic Regression

Logistic regression can handle multi-collinearity very well with its ability to perform L2 regularization
It gives convenient probability values for the observations
They have widespread industry comfort for classification problems and efficient implementations in other tools as well

Both the algorithms are very fast with respect to its execution time. Logistic regression works better if you have a single decision boundary that segregates different attributes whereas decision trees are efficient if you have multiple boundaries. Logistic Regression although being the most prevalent algorithm for resolving scaling issues in industry problems, its losing ground to other techniques with their implementation efficiency.

Both the techniques discussed are very efficient in its own ways and helps to classify your responses appropriately. Which one is most suitable, is your choice.

要查看或添加评论，请登录

Akhil Sakhardande的更多文章

Mathematics behind ML

2017年10月8日

Mathematics behind ML

Mathematics is the fundamental building block of any machine learning problem, or to that extent any optimization…

1 条评论
Route and Fleet Optimization

2016年11月11日

Route and Fleet Optimization

Route Optimization is considered to be a NP-hard problem in combinatorial optimization, and the worst case running time…
Deep learning with H2O

2016年7月3日

Deep learning with H2O

What is H2O H2O is an open source machine-learning platform where we can build models on large data and achieve…

6 条评论
Association Rules explained using R

2016年2月25日

Association Rules explained using R

Association rules is a methodology of discovering interesting relations between variables in a dataset. These are…

4 条评论
Parallel computations in R

2015年12月28日

Parallel computations in R

R is the widely used language for Data Scientist alongside Python. It helps you to perform numerous operations in…

13 条评论

See all articles

Most upright algorithm for classification – Decision tree or Logistic regression

Akhil Sakhardande

data science tech lead manager, AdTech/MarTech

Akhil Sakhardande的更多文章

社区洞察

其他会员也浏览了

Essential Data Science Concepts from A to Z

Linear and Logistic Regression in Data Science

Building Logistic Regression Models with Ease: How Extreme ML Streamlines the Process

7 Types of Classification Algorithms

Comparing Regression Algorithms: Which One to Choose?

Applied Predictive Modeling – Book Review

Data Preprocessing and Outlier Detection: Why Domain Knowledge Matters

Data Science: An In-Depth Exploration of R-Squared in Regression Analysis

Data Science Processes II - Model Development and Report Visualization

Unveiling the Hidden Depths of Data Science: Insights Beyond the Basics

Akhil Sakhardande的更多文章

Mathematics behind ML

Route and Fleet Optimization

Deep learning with H2O

Association Rules explained using R

Parallel computations in R

社区洞察

其他会员也浏览了

Essential Data Science Concepts from A to Z

Linear and Logistic Regression in Data Science

Building Logistic Regression Models with Ease: How Extreme ML Streamlines the Process

7 Types of Classification Algorithms

Comparing Regression Algorithms: Which One to Choose?

Applied Predictive Modeling – Book Review

Data Preprocessing and Outlier Detection: Why Domain Knowledge Matters

Data Science: An In-Depth Exploration of R-Squared in Regression Analysis

Data Science Processes II - Model Development and Report Visualization

Unveiling the Hidden Depths of Data Science: Insights Beyond the Basics