What is the best algorithm for classification problem?

What is the best algorithm for classification problem?

Classification is one of the data mining tasks, applied in many area especially in retail, banking sector, infrastructure and medical applications. One reason for using this technique is to make smart decision and for that we need to select the appropriate algorithm for each data set. It lead to the question of "What is the best algorithm for classification problem?" and "How to select it?"

First of all, I assume you already know this but it is always important to remember: there is no single algorithm that is better than all the others on all the problems. This is discussed by several authors. The type of data classification algorithm is highly dependent on your data set. 

There are a lot of classification techniques, each provides different benefits. You have to know the main characteristics of these techniques, to analyse your problem in depth and choose the appropriate technique based on this previous analysis; be clear about your goals and what the nature of the problem is. The most commonly used Machine Learning Algorithms Every Engineer Should Know are as follows:

  1. Artificial Neural Networks
  2. Gradient Boosting
  3. Random Forests
  4. Decision Trees
  5. Na?ve Bayes Classifier Algorithm
  6. Support Vector Machine Algorithm
  7. Logistic Regression
  8. K Nearest Neighbor

Among all these Decision tree, Random Forest and Gradient boosting are my favorite. But again other algorithms have their own pros and cons. Now let's discuss about these three algorithm in detail.

Decision Trees: A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

There are different types of decision tree algorithms like:

  • ID3 (Iterative dichotomiser3)
  • C4.5 (successor of ID3)
  • CART (Classification and Regression Tree)
  • CHAID (CHI-squared Automatic Interaction Detector)
  • MARS (extends decision trees to better handle numerical data)]

A)Strengths of the model:

  • Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
  • Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
  • Help determine worst, best and expected values for different scenarios.
  • Use a white box model. If a given result is provided by a model.
  • Can be combined with other decision techniques.

B) Weakness of the model:

  • They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
  • They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
  • For data including categorical variables with different number of levels, information gain in decision tree is biased in favor of those attributes with more levels.
  • Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.
No alt text provided for this image


Gradient Boosting: GBT build trees one at a time, where each new tree helps to correct errors made by previously trained tree.

A) Strengths of the model

Since boosted trees are derived by optimizing an objective function, basically GBM can be used to solve almost all objective function that we can write gradient out. This including things like ranking and poission regression, which RF is harder to achieve.

B) Weaknesses of the model

  • GBMs are more sensitive to overfitting if the data is noisy.
  • Training generally takes longer because of the fact that trees are built sequentially.
  • GBMs are harder to tune than RF. There are typically three parameters: number of trees, depth of trees and learning rate, and the each tree built is generally shallow.

Random Forest: RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data.

A) Strengths of the model

  • RF are much easier to tune than GBM. There are typically two parameters in RF: number of trees and number of features to be selected at each node.
  • RF are harder to overfit than GBM.

B) Weaknesses of the model

  • The main limitation of the Random Forests algorithm is that a large number of trees may make the algorithm slow for real-time prediction.
  • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem .
  • If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

Through this article, I would also like to thank each and everyone who read, liked, clapped, commented on my articles. This is the sole motivation which encourages me to write articles.

Keep reading and I’ll keep writing.

References:

https://en.wikipedia.org/wiki/Gradient_boosting

https://www.jair.org/index.php/jair/article/view/10127

https://en.wikipedia.org/wiki/Decision_tree

https://en.wikipedia.org/wiki/Random_forest

https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb

No alt text provided for this image

?


要查看或添加评论,请登录

Suravi Mahanta的更多文章

社区洞察

其他会员也浏览了