登录查看更多内容

Is it wise to use CART technique when the dependent variable is skewed towards one of the class?

Sarveshwaran Rajagopal

Data Scientist and Trainer (Gen AI & NLP) | Empowered 7000+ Professionals & Students to Excel in Data Science ?? | ?? Speaker and Thought Leader in Data Science ??

发布日期: 2018年12月10日

CART model is a non-parametric technique hence it can be used for a skewed data. Before applying CART model on a skewed data set, it is recommended to follow the CRISP-DM methodology. Analyse the skewed data set we are working on and understand the business problem in hand. After data interpretation, we need to treat the imbalanced data set. There are various techniques for handling the skewed data set like Under sampling, Oversampling, Synthetic data generation, Cost Sensitive learning.

After treating the data set, we use the CART model for classification. To prevent over-fitting and handle skewness, we can modify the parameters of the CART model - bucket -split and bucket limit and check the performance.

Classification accuracy is not right for measuring the performance of skewed data set because it give biased result to the majority class. We can use other methods for performance measure like Precision, recall, f - measure, ROC (AUC scores), Matthew Correlation coefficient, Gini coefficient.

Example of a skewed data set:

Factors for stock picking is 1,00,000
Do not influence – 97,900, Influence stock picking – 2,100

Confusion Matrix for the above example:

Actual Positive Actual Negative

Predicted Positive True Positive (1770) False Positive (150)

Predicted Negative False Negative (330) True Negative (97750)

Calculated values for each measure based on the above data

Measure Formula Calculated value

Accuracy TP+TN / TP+TN+FP+FN 99.52%

Recall TP / TP+FN 84.28%

Specificity TN / (TN +FP) 99.84%

Precision TP/ TP+FP 92.19%

F-Measure 2(Recall*Precision)/(Recall+precision) 87.81%

Precision-recall points and ROC are good performance measures. Even Matthew Correlation coefficient works well with skewed data. Further cross – validation helps to give better results. The advantage of keeping the whole ROC curve is that the correct optimal point can be chosen rather than choosing an arbitrary choice of cut-off point.

In real-time, it is the combination of technical and domain knowledge that helps us to bring out best result.

要查看或添加评论，请登录

查看全部

Is it wise to use CART technique when the dependent variable is skewed towards one of the class?

Sarveshwaran Rajagopal

Data Scientist and Trainer (Gen AI & NLP) | Empowered 7000+ Professionals & Students to Excel in Data Science ?? | ?? Speaker and Thought Leader in Data Science ??

更多精彩文章

社区洞察

其他会员也浏览了

Domains, Lists And Data

Non parametric statistical tests

In search of data quality

Cpk and Ppk: Process Capability Insights

Exponential smoothing methods often outperform ARIMA and SARIMAX models when dealing with limited data

Deriving a Simplified Formula from 2D Data Using Curve Fitting

When the data landscape is complex, the answers need to be simple

Data Science: Overfitting

Data and Decisions.

"Unlocking the Power of Big Data: The Four V's

Unlocking the Art of Research Paper Reading with Google Bard: A Guide for College Students and Professionals

2024年1月9日

Optimizing Your Resume for ATS with ChatGPT and Online Tools

2024年1月4日

The Rise of Large Language Models: Understanding the Latest Developments in AI

2023年4月8日

Data Science Redefined: Explore the Latest Developments in AI and Machine Learning for Data Scientists

2023年4月1日

How to Create Effective Data Visualizations in Tableau

2023年2月25日

Road Map to a Successful AI/ML Implementation - Key points from Harvard Briefing Paper

2023年2月22日

Stochastic Gradient Descent Overview

2023年2月20日

Thought process to create an indicator variable for rural prosperity at the District level every year using Satellite Imagery

2019年1月22日

Can loan foreclosure be identified forehand analytically?; Business rationale involved in loan foreclosures:

2018年12月11日

Applying stats in the field of Rice Mill:

2018年12月10日