登录查看更多内容

Why is it so important to avoid data with high correlation? How can you solve that?

Vinícius Santos

Data Engineer | Data Scientist | Data Analytics | AWS and Databicks Certified

发布日期: 2023年12月17日

In general, there are good reasons to avoid data with highly correlated predictors. First of all, redundant predictors often add more complexity to the model than the information they provide, making this trade-off not very advantageous.

Furthermore, in many scenarios, obtaining predictor data can be costly, either in terms of money or time. Therefore, using fewer variables is obviously better.

Money or time may not be your problem, but the quality of the model will probably be affected, because there are mathematical disadvantages to having correlated predictor data.

Using highly correlated predictors in techniques like linear regression can result in highly unstable models, numerical errors, and very poor prediction performance.

There are many theoretical approaches to understand and handle these highly correlated predictors, but I'll share with you a simple and effective method that I've used several times.

The algorithm is as follows and can be implemented in any programming language:

Calculate the correlation matrix of the predictors
Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B)
Determine the avarage between A and the all other variables of your dataset, then do the same for B
If A has a large avarage correlation, remove it; otherwise remove predictor B
Repeat steps 2 to 4 until no absolute correlations are above the threshold

The main idea is to first remove the predictors that have the most correlated relationships. Suppose we wanted to use a model sensitive to between-predictors correlation, we might apply a threshold of 0.75. This means that we aim to eliminate the minimum number of predictors to achieve all pairwise correlations less than 0.75.

Try building your model without handling your data, then build a second one to compare the results.

I hope that this content can help you!

TOMEK

1 年

Great article! It's really helpful to understand the importance of avoiding data with high correlation.

1 次回应

要查看或添加评论，请登录

Vinícius Santos的更多文章

Machine Learning model evaluation and the concept of these metrics

2024年1月12日

Machine Learning model evaluation and the concept of these metrics

Model evaluation in machine learning is vital to ensure the effectiveness and reliability of algorithms. It goes beyond…
Do you really know the importance of split the dataset correctly?

2024年1月6日

Do you really know the importance of split the dataset correctly?

The real importance of data splitting If you came here I belive that you know that basically machine learning is about…
ML lifecycle to solve business problems

2023年12月3日

ML lifecycle to solve business problems

Hi everybody, my name is Vinicius, and I'm the Data Science Coordinator at a big company called Cielo, in Brazil…
Florestas aleatórias no contexto de Machine Learning

2023年10月14日

Florestas aleatórias no contexto de Machine Learning

Imagina o cenário onde você fa?a uma pergunta para 1.000 universitários da USP, em seguida agrupe essa resposta.
O que você precisa saber sobre árvore de Decis?o (Classifica??o) - Teoria para esclarecer a prática

2023年9月24日

O que você precisa saber sobre árvore de Decis?o (Classifica??o) - Teoria para esclarecer a prática

Sem dúvidas a árvore de decis?o é um dos algoritmos mais legais de se trabalhar para aprendizado de máquina…

1 条评论
Como escolher as variáveis e algoritmo para seu modelo de Machine Learning?

2023年9月11日

Como escolher as variáveis e algoritmo para seu modelo de Machine Learning?

Um ponto de partida importante é esclarecer de forma objetiva o que é o "tal" do Feature Selection..

2 条评论
Regress?o linear Simples e Múltipla em Python - 100% prático

2020年3月2日

Regress?o linear Simples e Múltipla em Python - 100% prático

Olá, Pessoal!! Espero que estejam muito bem!! Como prometido na semana passada, construí um material totalmente prático…

6 条评论
Simplificando e explicando modelo de regress?o - Entenda a mágica!

2020年2月25日

Simplificando e explicando modelo de regress?o - Entenda a mágica!

Normalmente, quando os "n?o matemáticos" ouvem falar de regress?o logo pensam em formulas malucas, complexas e…
Usando machine learning ao nosso favor: Detec??o de fraudes em cart?o de crédito no Python!

2020年1月2日

Usando machine learning ao nosso favor: Detec??o de fraudes em cart?o de crédito no Python!

Olá, pessoal! Estou aqui para compartilhar um conhecimento bem legal, que pode ser usado de ponta pé inicial para todos…

2 条评论

See all articles

Why is it so important to avoid data with high correlation? How can you solve that?

Vinícius Santos

Data Engineer | Data Scientist | Data Analytics | AWS and Databicks Certified

Vinícius Santos的更多文章

社区洞察

其他会员也浏览了

What to Check Before You Decide to Apply a Linear Regression Model

Does Correlation really prove Causation!

Handling missing values in Machine learning dataset

Which path leads to success? Explore the world of decision trees

Two Pointers

Making Predictions with Regression Models

Day 1 - Data Science Interview Questions

Confused With Terms : Sample, Batch and Epoch?

Lazyme Package

ML Series — 2–Linear Regression Simplified

Vinícius Santos的更多文章

Machine Learning model evaluation and the concept of these metrics

Do you really know the importance of split the dataset correctly?

ML lifecycle to solve business problems

Florestas aleatórias no contexto de Machine Learning

O que você precisa saber sobre árvore de Decis?o (Classifica??o) - Teoria para esclarecer a prática

Como escolher as variáveis e algoritmo para seu modelo de Machine Learning?

Regress?o linear Simples e Múltipla em Python - 100% prático

Simplificando e explicando modelo de regress?o - Entenda a mágica!

Usando machine learning ao nosso favor: Detec??o de fraudes em cart?o de crédito no Python!

社区洞察

其他会员也浏览了

What to Check Before You Decide to Apply a Linear Regression Model

Does Correlation really prove Causation!

Handling missing values in Machine learning dataset

Which path leads to success? Explore the world of decision trees

Two Pointers

Making Predictions with Regression Models

Day 1 - Data Science Interview Questions

Confused With Terms : Sample, Batch and Epoch?

Lazyme Package

ML Series — 2–Linear Regression Simplified