BxD Primer Series: Decision Trees for Classification
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Decision Trees for Classification. Let’s get started:
The What:
In the previous edition on Decision Trees for Regression (check?here ), we explained the process of developing decision trees and noted that they are meant for classification problems. We will build on that previous edition so please read that if you haven’t.
The goal in classification is to predict the class or category of a given input. For example, given information in an email, a classification algorithm might be trained to predict whether it is a spam or normal email.
Decision trees (a supervised machine learning algorithm) can be used for classification problems. The basic idea behind a decision tree is to recursively split the input data into subsets based on the values of different features, until the subsets are pure (i.e., all of the examples in a subset belong to the same class) or a stop criteria is hit. Each split is based on a decision rule that is learned from the training data.
To make a prediction using decision tree, traverse the tree from the root to a leaf node, following the path determined by the answers to the questions at each node. The final prediction is the class associated with the leaf node.
The Why:
Decision Trees have several advantages over other types of models, such as:
Use Cases:
Decision Trees are used in various real-world applications across industries, including:
These are just a few examples of the many real-world applications of Decision Trees. The versatility and interpretability of Decision Trees make them a popular and effective tool for a wide range of machine learning problems.
Node?Split Criterions:
Gini impurity, entropy and chi-squared test are three most commonly used node split criterions for classification problems.
??Gini Impurity?measures the probability of incorrectly classifying a randomly chosen sample in a dataset. It ranges from 0 to 1, where 0 represents a perfectly classified dataset and 1 represents a completely unclassified dataset.
To find the best split, we evaluate the impurity of each potential split and select the one with the lowest impurity. The formula for calculating the Gini impurity for a potential split is:
Where,
Potential split can be at any of the available features and feature values. Gini Impurity will be calculated for each of those potential splits and whichever split has lowest impurity value will become the node decision.
??Entropy?is a measure of randomness in a set of samples. To evaluate the quality of a potential split at a node, first calculate entropy of node as below:
Where,
To perform a split on a feature (f) at value (v), the parent node is partitioned into two subsets S1 and S2. The information gain from the split is calculated as:
Where,
The feature and feature value that result in the highest information gain are chosen as the node decision. This process is repeated recursively for each subset until a stopping criterion is met.
领英推荐
??Chi-squared Test:?Chi-Square Test is used to determine the independence of two categorical variables. It is performed on the contingency table constructed using the feature values and the target variable. The contingency table is a two-dimensional table that displays the frequency distribution of the samples across different values of the feature and target variable.
The Chi-Square Test statistic is calculated as:
Where,
The expected frequency,?E_i,j?for each cell in the contingency table is calculated as:
The null hypothesis is that the feature and the target variable are independent. If the Chi-Square Test statistic is large enough, then the null hypothesis is rejected, and it is concluded that the feature and the target variable are dependent.
The degree of freedom for the Chi-Square Test is calculated as:
df = (r - 1)*(c - 1)
The Chi-Square Test statistic is then compared to the critical value of the Chi-Square distribution with?df?degrees of freedom at a given significance level. If the Chi-Square Test statistic is greater than the critical value, then the null hypothesis is rejected, and the feature is considered for the split in the decision tree.
Note:?The critical value can be looked up in a table of the Chi-Square distribution.
Hyper-parameters in Decision Tree:
The hyper-parameters in Decision Trees acts as stop criterions. Instead of specifying a hard stop criteria, it is a good practice to tune them for data and problem at hand. There can be multiple hyper-parameters available depending on the implementation you are using. They fall in below four categories:
Hyper-parameter Tuning Techniques:
Common methods for tuning the hyper-parameters of a Decision Tree model are:
Note:?We will do a separate edition explaining the?model evaluation techniques?used for classification problems. As self-exploration, you should start with confusion matrix, accuracy, recall, F1 measure etc.
Additionally,?feature importance in decision trees?is an interesting topic that we will leave for self-study. Start with techniques such as Gini Importance, Mean Decrease Impurity, Permutation Importance etc.
The Why Not:
Although decision tree is an appropriate algorithm for most classification task, there are few reason to not use them and consider other alternatives:
Time for you to help in return:
In next coming posts, we will be covering one more type of decision tree models - Conditional Inference in similar format. Post that we will move to other classification models such as Support Vector Machines, Naive Bayes and K-Nearest Neighbours.
Let us know your feedback!
Until then,
Enjoy life in full ??
Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)
1 年If you prefer email updates, visit here: https://anothermayank.substack.com #substack