Why is it so important to avoid data with high correlation? How can you solve that?
Vinícius Santos
Data Engineer | Data Scientist | Data Analytics | AWS and Databicks Certified
In general, there are good reasons to avoid data with highly correlated predictors. First of all, redundant predictors often add more complexity to the model than the information they provide, making this trade-off not very advantageous.
Furthermore, in many scenarios, obtaining predictor data can be costly, either in terms of money or time. Therefore, using fewer variables is obviously better.
Money or time may not be your problem, but the quality of the model will probably be affected, because there are mathematical disadvantages to having correlated predictor data.
Using highly correlated predictors in techniques like linear regression can result in highly unstable models, numerical errors, and very poor prediction performance.
There are many theoretical approaches to understand and handle these highly correlated predictors, but I'll share with you a simple and effective method that I've used several times.
The algorithm is as follows and can be implemented in any programming language:
The main idea is to first remove the predictors that have the most correlated relationships. Suppose we wanted to use a model sensitive to between-predictors correlation, we might apply a threshold of 0.75. This means that we aim to eliminate the minimum number of predictors to achieve all pairwise correlations less than 0.75.
Try building your model without handling your data, then build a second one to compare the results.
I hope that this content can help you!
Great article! It's really helpful to understand the importance of avoiding data with high correlation.