Why is it so important to avoid data with high correlation? How can you solve that?

Why is it so important to avoid data with high correlation? How can you solve that?

In general, there are good reasons to avoid data with highly correlated predictors. First of all, redundant predictors often add more complexity to the model than the information they provide, making this trade-off not very advantageous.

Furthermore, in many scenarios, obtaining predictor data can be costly, either in terms of money or time. Therefore, using fewer variables is obviously better.

Money or time may not be your problem, but the quality of the model will probably be affected, because there are mathematical disadvantages to having correlated predictor data.

Using highly correlated predictors in techniques like linear regression can result in highly unstable models, numerical errors, and very poor prediction performance.

There are many theoretical approaches to understand and handle these highly correlated predictors, but I'll share with you a simple and effective method that I've used several times.

The algorithm is as follows and can be implemented in any programming language:

  1. Calculate the correlation matrix of the predictors
  2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B)
  3. Determine the avarage between A and the all other variables of your dataset, then do the same for B
  4. If A has a large avarage correlation, remove it; otherwise remove predictor B
  5. Repeat steps 2 to 4 until no absolute correlations are above the threshold

The main idea is to first remove the predictors that have the most correlated relationships. Suppose we wanted to use a model sensitive to between-predictors correlation, we might apply a threshold of 0.75. This means that we aim to eliminate the minimum number of predictors to achieve all pairwise correlations less than 0.75.

Try building your model without handling your data, then build a second one to compare the results.

I hope that this content can help you!

Great article! It's really helpful to understand the importance of avoiding data with high correlation.

要查看或添加评论,请登录

Vinícius Santos的更多文章

社区洞察

其他会员也浏览了