BxD Primer Series: Decision Trees for Classification

BxD Primer Series: Decision Trees for Classification

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Decision Trees for Classification. Let’s get started:

The What:

In the previous edition on Decision Trees for Regression (check?here ), we explained the process of developing decision trees and noted that they are meant for classification problems. We will build on that previous edition so please read that if you haven’t.

The goal in classification is to predict the class or category of a given input. For example, given information in an email, a classification algorithm might be trained to predict whether it is a spam or normal email.

Decision trees (a supervised machine learning algorithm) can be used for classification problems. The basic idea behind a decision tree is to recursively split the input data into subsets based on the values of different features, until the subsets are pure (i.e., all of the examples in a subset belong to the same class) or a stop criteria is hit. Each split is based on a decision rule that is learned from the training data.

To make a prediction using decision tree, traverse the tree from the root to a leaf node, following the path determined by the answers to the questions at each node. The final prediction is the class associated with the leaf node.

The Why:

Decision Trees have several advantages over other types of models, such as:

  1. Interpretable: Decision Trees are easy to understand and interpret, making them a popular choice for many applications.
  2. Handling Non-linear Relationships: Decision Trees can handle non-linear relationships between features and targets effectively.
  3. Handles Both Numerical and Categorical Data: Decision Trees can handle both numerical and categorical data, without requiring any pre-processing of data.
  4. Handles Missing Values: Decision Trees can handle missing values in data without any pre-processing.
  5. Scalable: Decision Trees can scale well with large datasets, making them a good choice for many applications.
  6. Non-parametric: Decision trees do not make any assumptions about the distribution of the data. This can make them more flexible and robust than parametric models such as logistic regression.

Use Cases:

Decision Trees are used in various real-world applications across industries, including:

  1. Healthcare: Decision Trees are used in healthcare to identify medical conditions and determine the best treatment plan based on patient symptoms and medical history.
  2. Finance: Decision Trees are used in finance to analyze credit risk and to identify fraud in credit card transactions.
  3. Marketing: Decision Trees are used in marketing to segment customers, identify target audiences, and personalize promotions based on customer behavior.
  4. Manufacturing: Decision Trees are used in manufacturing to identify the causes of quality issues, optimize production processes, and predict equipment failures.
  5. Agriculture: Decision Trees are used in agriculture to identify the optimal crop planting conditions based on soil type, climate, and other environmental factors.
  6. Customer Service: Decision Trees are used in customer service to route customer inquiries to the appropriate department or agent, based on the nature of the inquiry.
  7. Human Resources: Decision Trees are used in human resources to screen job applicants based on their qualifications and experience.
  8. Environmental Science: Decision Trees are used in environmental science to identify the factors that contribute to pollution, and to predict the impact of different interventions.

These are just a few examples of the many real-world applications of Decision Trees. The versatility and interpretability of Decision Trees make them a popular and effective tool for a wide range of machine learning problems.

Node?Split Criterions:

Gini impurity, entropy and chi-squared test are three most commonly used node split criterions for classification problems.

??Gini Impurity?measures the probability of incorrectly classifying a randomly chosen sample in a dataset. It ranges from 0 to 1, where 0 represents a perfectly classified dataset and 1 represents a completely unclassified dataset.

To find the best split, we evaluate the impurity of each potential split and select the one with the lowest impurity. The formula for calculating the Gini impurity for a potential split is:

No alt text provided for this image

Where,

  • m?is the number of child nodes resulting from the split, two in most implementations
  • n?is the total number of samples in the parent node
  • n_i is the number of samples in the i’th child node
  • C?is the number of classes, two for binary classification
  • S_i,c is the number of samples in the i’th child node that belong to the c’th class

Potential split can be at any of the available features and feature values. Gini Impurity will be calculated for each of those potential splits and whichever split has lowest impurity value will become the node decision.

??Entropy?is a measure of randomness in a set of samples. To evaluate the quality of a potential split at a node, first calculate entropy of node as below:

No alt text provided for this image

Where,

  • C?is the number of classes, two for binary classification
  • n is the total number of samples in the parent node
  • n_c is the number of samples in node that belong to the c’th class

To perform a split on a feature (f) at value (v), the parent node is partitioned into two subsets S1 and S2. The information gain from the split is calculated as:

No alt text provided for this image

Where,

  • n_S1 is number of samples in subsets S1
  • n_S2 is number of samples in subsets S2

The feature and feature value that result in the highest information gain are chosen as the node decision. This process is repeated recursively for each subset until a stopping criterion is met.

??Chi-squared Test:?Chi-Square Test is used to determine the independence of two categorical variables. It is performed on the contingency table constructed using the feature values and the target variable. The contingency table is a two-dimensional table that displays the frequency distribution of the samples across different values of the feature and target variable.

The Chi-Square Test statistic is calculated as:

No alt text provided for this image

Where,

  • r?is the number of rows in contingency table
  • c?is the number of columns in contingency table
  • O_i,j?is the observed frequency of samples in row?i?and column?j
  • E*_i,j* is the expected frequency of samples in row?i?and column?j

The expected frequency,?E_i,j?for each cell in the contingency table is calculated as:

No alt text provided for this image

The null hypothesis is that the feature and the target variable are independent. If the Chi-Square Test statistic is large enough, then the null hypothesis is rejected, and it is concluded that the feature and the target variable are dependent.

The degree of freedom for the Chi-Square Test is calculated as:

df = (r - 1)*(c - 1)

The Chi-Square Test statistic is then compared to the critical value of the Chi-Square distribution with?df?degrees of freedom at a given significance level. If the Chi-Square Test statistic is greater than the critical value, then the null hypothesis is rejected, and the feature is considered for the split in the decision tree.

Note:?The critical value can be looked up in a table of the Chi-Square distribution.

Hyper-parameters in Decision Tree:

The hyper-parameters in Decision Trees acts as stop criterions. Instead of specifying a hard stop criteria, it is a good practice to tune them for data and problem at hand. There can be multiple hyper-parameters available depending on the implementation you are using. They fall in below four categories:

  1. Maximum depth of tree
  2. Minimum number of samples required in leaf node
  3. Maximum number of features that can be considered for each split
  4. Maximum number of leaf or total nodes in tree

Hyper-parameter Tuning Techniques:

Common methods for tuning the hyper-parameters of a Decision Tree model are:

  1. Grid Search: This method involves specifying a grid of possible values for each hyper-parameter and then?exhaustively training models for all possible combinations?of hyper-parameters to find the best set of hyper-parameters that optimize a performance metric, such as accuracy or F1 score.
  2. Random Search: This method involves specifying a distribution for each hyper-parameter, and then?randomly picking?combinations from these distributions. Training models on those combinations to find the best set of hyper-parameters.
  3. Bayesian Optimization: This method involves constructing a?probabilistic model?of the performance of the model as a function of the hyper-parameters, and then using this model to?guide the search?for the best set of hyper-parameters.

Note:?We will do a separate edition explaining the?model evaluation techniques?used for classification problems. As self-exploration, you should start with confusion matrix, accuracy, recall, F1 measure etc.

Additionally,?feature importance in decision trees?is an interesting topic that we will leave for self-study. Start with techniques such as Gini Importance, Mean Decrease Impurity, Permutation Importance etc.

The Why Not:

Although decision tree is an appropriate algorithm for most classification task, there are few reason to not use them and consider other alternatives:

  1. Overfitting: Decision Trees can overfit on the training data if the tree is too deep or if the data is noisy. This can result in poor generalization on new data. One alternative is to use regularization techniques like?pruning, where the tree is trimmed back to reduce complexity. Another alternative is to use ensemble methods like?Random Forest or Gradient Boosted Trees, which create multiple decision trees and combine their results to improve performance and reduce overfitting.
  2. Instability: Decision Trees can be unstable, meaning that small changes in the data can result in significantly different trees. The alternative is to use bagging or boosting techniques to?randomly sample or weight the training data, which can make the resulting tree more robust to changes in the data.
  3. Bias Towards Few Features: Decision Trees can be biased towards features with many levels or high cardinality. One alternative is to use feature selection or feature engineering techniques to reduce the number of levels or?create new features that capture the same information?in a simpler way. Another alternative is to use models that are not affected by high cardinality features, such as?linear models or support vector machines.
  4. Linear Decision Boundaries: Decision Trees can only create linear decision boundaries between classes, which may not always be sufficient for some applications. The alternative is to use models that can create non-linear decision boundaries, such as?neural networks, support vector machines with kernel functions.
  5. Greedy: Decision Trees are greedy algorithms, meaning that they make the best decision at each node, without considering the overall best path for the entire tree. Alternative is to use?reinforcement learning techniques, where the tree is trained to maximize a reward function rather than minimizing a loss function.

Time for you to help in return:

  1. Reply to this edition with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here )
  4. Engage with BxD on LinkedIN (here )

In next coming posts, we will be covering one more type of decision tree models - Conditional Inference in similar format. Post that we will move to other classification models such as Support Vector Machines, Naive Bayes and K-Nearest Neighbours.

Let us know your feedback!

Until then,

Enjoy life in full ??

#businessxdata ?#bxd ?#classification ?#decisiontrees ?#primer

Mayank K.

Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)

1 年

If you prefer email updates, visit here: https://anothermayank.substack.com #substack

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了