登录查看更多内容

Omitted Variable Bias (OVB)

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

发布日期: 2024年11月23日

+ 关注

You performed a regression between house prices and area and obtained a coefficient (β) for area.

You’d interpret it that on an average the house prices are β more for every unit increase in the area of the house.

You want to improve the performance of the model and also add the crime rate to the model. Surprise! The β reduced drastically — sometimes could even become negative.

You’ve just encountered the Omitted Variable Bias (OVB).

The bias in estimating your β because you did not include some variable in the model.

But why is OVB giving a biased estimate? What assumptions of the OLS did it violate?

Comment!

In this blog I’ll also show OVB

I created two simulated datasets of same size with columns price (continous), area (continous), crime_rate (binary).

Dataset 1

When I only regress price with area, I got β of 7639. So, every unit increase in area is associated with an increase in 7639 units in the average house price.

When I add the crime rate (a binary variable) also to the model, my β value did not change. It is 7639. But the high crime (crime=1) areas are associated with 11410 less price on average.

If I add the interaction between crime and area and then plot both regression lines, this is what I would get.

But, see how the intercept has changed. From 22940 to 28650–6000 change.

This dataset did not create any bias in the β when I removed the crime variable from the model. Crime is an important variable (trust me, I created this dataset). But, removing it only impacted my intercept, not β.

领英推荐

From Wakefit's journey to the rise of cybercrimes in…

Forbes India 1 年前

Sucheta Dalal: Women who exposed the biggest scams in…

5paisa 1 年前

World of Risk: Chances, Independent Events &…

Prof. Procyon Mukherjee 5 年前

One thing gets cleared. Omitting important variables alone does not produce the OVB in β. Something else is needed.

Dataset 2

When I only regress price with area, I got β of 8561. So, every unit increase in area is associated with an increase in 8561 units in the average house price.

But when I added crime to the model, the β changed ‘significantly’ (the 95%CIs don’t overlap). β is reduced to 7861. Removing crime led to an overestimate of β by 700 units.

If I add the interaction between crime and area and then plot both regression lines, this is what I would get.

Apart from the intercept, even β got impacted in this dataset. Why?

You can see that in the first dataset, the blue, orange slopes are the same. I simulated the dataset in such a way that crime impacts the base price of the house. But holding the crime constant, area affects the price identically. Mathematically, I gave zero correlation between area and crime. That is why, even if crime is an important variable in explaining price, omitting it did not create OVB.

In the second dataset, you can see that the blue and orange slopes are different. I simulated in such a way that if crime is more, there will be a smaller number of big houses in such areas. (base rate is also low). Mathematically, I gave negative correlation between area and crime. Hence, omitting crime variable created the OVB.

But, will not it create multi-collinearity issue? It will! But that is a lesser devil to deal with. When your predictor is correlating with both dependent variable and other predictors, you are facing at the OVB — which gives a biased estimate of β. Multi collinearity will make β estimate less efficient (high standard error) — we can deal with it.

Read these amazing notes for more clarity

OVB versus Multicollinearity | ARE Berkeley

Practically speaking, how would you ever know for sure that you didn’t miss any important variable? It’s impossible. So, all our models are subject to this OVB. It is the duty of the statistician to make the model robust to OVB by adding different variables (sometimes even if they don’t add any explanatory value).

Resources:

要查看或添加评论，请登录

Sai Krishna Dammalapati的更多文章

LogProbs

2025年3月21日

LogProbs

LogProbs is one of the basic skills for a prompt engineer to have. Some background before implementing it: An LLM model…
When to brush your teeth? A good ANOVA study!

2025年1月10日

When to brush your teeth? A good ANOVA study!

I found this paper which did a simple ANOVA study to find out when should one brush their teeth! TL;DR Brush twice a…
Statistical issues in this paper studying relation between air quality and LULC

2024年12月24日

Statistical issues in this paper studying relation between air quality and LULC

A paper got published in Environmental Monitoring and Assessment. It studied relation between land-use classes (Urban…
Bayesian probabilistic forecasts using categorical information | Part 1

2024年12月13日

Bayesian probabilistic forecasts using categorical information | Part 1

In this blog, I will make Bayesian forecasts of Ozone concentrations. My previous blog on Bayesian analysis: Bayesian…
100% Mediation in Action

2024年12月5日

100% Mediation in Action

I wrote about Mediators in the previous article. This is a follow-up to it.
Mediators

2024年12月2日

Mediators

I one of my previous blogs, we saw Omitted Variable Bias. In this blog, we’ll do mediation analysis using the same…
Visualize Collider Bias with me

2024年11月30日

Visualize Collider Bias with me

It’s 2020. You are a doctor.
A Statistician counts well

2024年11月27日

A Statistician counts well

I’ve come across an article Counting as Statistics in Saket Choudhary's blog. The blog has a story on how statisticians…
Clarifications into Regression Discontinuity Design (RDD)

2024年11月19日

Clarifications into Regression Discontinuity Design (RDD)

I came across one RDD study last week where observational data was used to find the causal link between air pollution…
Real estate broker working with Linear Regression on imbalanced data

2024年10月30日

Real estate broker working with Linear Regression on imbalanced data

I used Housing price data for this analysis. Previous blog based on the same dataset are: How’d you lose in real-estate…

See all articles

Omitted Variable Bias (OVB)

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

Dataset 1

领英推荐

Dataset 2

Sai Krishna Dammalapati的更多文章

社区洞察

其他会员也浏览了

Let 's be honest… Lies, Damn Lies, and Statistics!

Unmasking the Climate Crisis: How AI Training and Book Vouchers Can Transform the #StateOfClimate

Why we need a 'TRC' for private sector elements that enabled state capture

Data Science and a Serial Killer

NewsMatch Alert: Press Release Summary | 12 March 2024

GAME-CHANGER: DNA machine helps Tulsa police with cold cases

Trade Surveillance System Market is Booming with Strong Growth Prospects

Part 5 Police Officers Survival Guide.

Criminal Files I Drug Facilitated Sexual Assault

FBI's Finest vs. Mafia Might: The Dangerous Price of Success

Dataset 1

领英推荐

Dataset 2

Sai Krishna Dammalapati的更多文章

LogProbs

When to brush your teeth? A good ANOVA study!

Statistical issues in this paper studying relation between air quality and LULC

Bayesian probabilistic forecasts using categorical information | Part 1

100% Mediation in Action

Mediators

Visualize Collider Bias with me

A Statistician counts well

Clarifications into Regression Discontinuity Design (RDD)

Real estate broker working with Linear Regression on imbalanced data

社区洞察

其他会员也浏览了

Let 's be honest… Lies, Damn Lies, and Statistics!

Unmasking the Climate Crisis: How AI Training and Book Vouchers Can Transform the #StateOfClimate

Why we need a 'TRC' for private sector elements that enabled state capture

Data Science and a Serial Killer

NewsMatch Alert: Press Release Summary | 12 March 2024

GAME-CHANGER: DNA machine helps Tulsa police with cold cases

Trade Surveillance System Market is Booming with Strong Growth Prospects

Part 5 Police Officers Survival Guide.

Criminal Files I Drug Facilitated Sexual Assault

FBI's Finest vs. Mafia Might: The Dangerous Price of Success