Is it wise to use CART technique when the dependent variable is skewed towards one of the class?
Sarveshwaran Rajagopal
Data Scientist and Trainer (Gen AI & NLP) | Empowered 7000+ Professionals & Students to Excel in Data Science ?? | ?? Speaker and Thought Leader in Data Science ??
CART model is a non-parametric technique hence it can be used for a skewed data. Before applying CART model on a skewed data set, it is recommended to follow the CRISP-DM methodology. Analyse the skewed data set we are working on and understand the business problem in hand. After data interpretation, we need to treat the imbalanced data set. There are various techniques for handling the skewed data set like Under sampling, Oversampling, Synthetic data generation, Cost Sensitive learning.
After treating the data set, we use the CART model for classification. To prevent over-fitting and handle skewness, we can modify the parameters of the CART model - bucket -split and bucket limit and check the performance.
Classification accuracy is not right for measuring the performance of skewed data set because it give biased result to the majority class. We can use other methods for performance measure like Precision, recall, f - measure, ROC (AUC scores), Matthew Correlation coefficient, Gini coefficient.
Example of a skewed data set:
- Factors for stock picking is 1,00,000
- Do not influence – 97,900, Influence stock picking – 2,100
Confusion Matrix for the above example:
Actual Positive Actual Negative
Predicted Positive True Positive (1770) False Positive (150)
Predicted Negative False Negative (330) True Negative (97750)
Calculated values for each measure based on the above data
Measure Formula Calculated value
Accuracy TP+TN / TP+TN+FP+FN 99.52%
Recall TP / TP+FN 84.28%
Specificity TN / (TN +FP) 99.84%
Precision TP/ TP+FP 92.19%
F-Measure 2(Recall*Precision)/(Recall+precision) 87.81%
Precision-recall points and ROC are good performance measures. Even Matthew Correlation coefficient works well with skewed data. Further cross – validation helps to give better results. The advantage of keeping the whole ROC curve is that the correct optimal point can be chosen rather than choosing an arbitrary choice of cut-off point.
In real-time, it is the combination of technical and domain knowledge that helps us to bring out best result.