登录查看更多内容

What is the best algorithm for classification problem?

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

发布日期: 2019年4月23日

Classification is one of the data mining tasks, applied in many area especially in retail, banking sector, infrastructure and medical applications. One reason for using this technique is to make smart decision and for that we need to select the appropriate algorithm for each data set. It lead to the question of "What is the best algorithm for classification problem?" and "How to select it?"

First of all, I assume you already know this but it is always important to remember: there is no single algorithm that is better than all the others on all the problems. This is discussed by several authors. The type of data classification algorithm is highly dependent on your data set.

There are a lot of classification techniques, each provides different benefits. You have to know the main characteristics of these techniques, to analyse your problem in depth and choose the appropriate technique based on this previous analysis; be clear about your goals and what the nature of the problem is. The most commonly used Machine Learning Algorithms Every Engineer Should Know are as follows:

Artificial Neural Networks
Gradient Boosting
Random Forests
Decision Trees
Na?ve Bayes Classifier Algorithm
Support Vector Machine Algorithm
Logistic Regression
K Nearest Neighbor

Among all these Decision tree, Random Forest and Gradient boosting are my favorite. But again other algorithms have their own pros and cons. Now let's discuss about these three algorithm in detail.

Decision Trees: A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

There are different types of decision tree algorithms like:

ID3 (Iterative dichotomiser3)
C4.5 (successor of ID3)
CART (Classification and Regression Tree)
CHAID (CHI-squared Automatic Interaction Detector)
MARS (extends decision trees to better handle numerical data)]

A)Strengths of the model:

Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
Help determine worst, best and expected values for different scenarios.
Use a white box model. If a given result is provided by a model.
Can be combined with other decision techniques.

B) Weakness of the model:

They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
For data including categorical variables with different number of levels, information gain in decision tree is biased in favor of those attributes with more levels.
Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

Gradient Boosting: GBT build trees one at a time, where each new tree helps to correct errors made by previously trained tree.

A) Strengths of the model

Since boosted trees are derived by optimizing an objective function, basically GBM can be used to solve almost all objective function that we can write gradient out. This including things like ranking and poission regression, which RF is harder to achieve.

B) Weaknesses of the model

GBMs are more sensitive to overfitting if the data is noisy.
Training generally takes longer because of the fact that trees are built sequentially.
GBMs are harder to tune than RF. There are typically three parameters: number of trees, depth of trees and learning rate, and the each tree built is generally shallow.

Random Forest: RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data.

A) Strengths of the model

RF are much easier to tune than GBM. There are typically two parameters in RF: number of trees and number of features to be selected at each node.
RF are harder to overfit than GBM.

B) Weaknesses of the model

The main limitation of the Random Forests algorithm is that a large number of trees may make the algorithm slow for real-time prediction.
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem .
If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

References:

https://en.wikipedia.org/wiki/Gradient_boosting

https://www.jair.org/index.php/jair/article/view/10127

https://en.wikipedia.org/wiki/Decision_tree

https://en.wikipedia.org/wiki/Random_forest

https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb

要查看或添加评论，请登录

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

2020年6月20日

Why we need Analytics in Suppl chain management

Supply chain is the backbone of any product based company. To have a succesful busniess, you need a sucessfull supply…
Prophet Forecasting

2020年6月7日

Prophet Forecasting

Forecasting is one of the most commonly used machine learning algorithms in any business. It’s become a necessity to…
Important statistics for Data science

2020年2月13日

Important statistics for Data science

In this blog I'll try to cover most of the statistical measures which are used in almost all data science projects. So…

5 条评论
Hierarchical clustering: The simplest clustering algorithm

2019年8月21日

Hierarchical clustering: The simplest clustering algorithm

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into…

4 条评论
Text Mining

2019年7月7日

Text Mining

Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that…

2 条评论
Let's evaluate classification model with ROC and PR curves.

2019年5月31日

Let's evaluate classification model with ROC and PR curves.

Model evaluation is one of the most important part while building model. And to evaluate the model we use different…

4 条评论
Let's find the value of K for K means in 2 minutes and by using two methods.

2019年5月10日

Let's find the value of K for K means in 2 minutes and by using two methods.

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as K…

1 条评论
Journey from DataBricks to Azure DataBricks

2019年4月4日

Journey from DataBricks to Azure DataBricks

DataBricks is an organization and big data processing platform designed by the creators of Apache Spark. It was founded…
What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

2019年3月24日

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

Intelligence has been defined in many ways, including: the capacity for logic, understanding, self-awareness, learning,…

See all articles

What is the best algorithm for classification problem?

Suravi Mahanta

Senior Consultant at EY GDS | Ex-Accenture | Microsoft Modern Data Platform Expert | Big Data Specialist | AI/ML Engineer | 4X Microsoft Certified | 3X Databricks Certified | Data Architecture

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了

Augmentation Data Deep Dive

Naive bayes Classification

Top Data Science and Machine Learning Methods Used

Big Data Risk Analytics

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

AI Atlas #7: Clustering

"Predicting Credit Card Defaults in Taiwan Using Machine Learning"

?? A Data Project Challenge by NeuroTrade: Prediction of Stock Price Direction

Day 28 — Time Series Analysis and Forecasting

Decision Tree

Suravi Mahanta的更多文章

Why we need Analytics in Suppl chain management

Prophet Forecasting

Important statistics for Data science

Hierarchical clustering: The simplest clustering algorithm

Text Mining

Let's evaluate classification model with ROC and PR curves.

Let's find the value of K for K means in 2 minutes and by using two methods.

Journey from DataBricks to Azure DataBricks

What’s intelligence and how intelligent an Artificial Intelligence Engineer must be?

社区洞察

其他会员也浏览了

Augmentation Data Deep Dive

Naive bayes Classification

Top Data Science and Machine Learning Methods Used

Big Data Risk Analytics

Demystifying Machine Learning: A Guided Tour of the Top 10 Algorithms

AI Atlas #7: Clustering

"Predicting Credit Card Defaults in Taiwan Using Machine Learning"

?? A Data Project Challenge by NeuroTrade: Prediction of Stock Price Direction

Day 28 — Time Series Analysis and Forecasting

Decision Tree