Omitted Variable Bias (OVB)

Omitted Variable Bias (OVB)

You performed a regression between house prices and area and obtained a coefficient (β) for area.

You’d interpret it that on an average the house prices are β more for every unit increase in the area of the house.

You want to improve the performance of the model and also add the crime rate to the model. Surprise! The β reduced drastically — sometimes could even become negative.

You’ve just encountered the Omitted Variable Bias (OVB).

The bias in estimating your β because you did not include some variable in the model.


But why is OVB giving a biased estimate? What assumptions of the OLS did it violate?

Comment!

In this blog I’ll also show OVB


I created two simulated datasets of same size with columns price (continous), area (continous), crime_rate (binary).

Dataset 1

When I only regress price with area, I got β of 7639. So, every unit increase in area is associated with an increase in 7639 units in the average house price.

When I add the crime rate (a binary variable) also to the model, my β value did not change. It is 7639. But the high crime (crime=1) areas are associated with 11410 less price on average.

If I add the interaction between crime and area and then plot both regression lines, this is what I would get.

But, see how the intercept has changed. From 22940 to 28650–6000 change.

This dataset did not create any bias in the β when I removed the crime variable from the model. Crime is an important variable (trust me, I created this dataset). But, removing it only impacted my intercept, not β.

One thing gets cleared. Omitting important variables alone does not produce the OVB in β. Something else is needed.


Dataset 2

When I only regress price with area, I got β of 8561. So, every unit increase in area is associated with an increase in 8561 units in the average house price.

But when I added crime to the model, the β changed ‘significantly’ (the 95%CIs don’t overlap). β is reduced to 7861. Removing crime led to an overestimate of β by 700 units.

If I add the interaction between crime and area and then plot both regression lines, this is what I would get.

Apart from the intercept, even β got impacted in this dataset. Why?


You can see that in the first dataset, the blue, orange slopes are the same. I simulated the dataset in such a way that crime impacts the base price of the house. But holding the crime constant, area affects the price identically. Mathematically, I gave zero correlation between area and crime. That is why, even if crime is an important variable in explaining price, omitting it did not create OVB.

In the second dataset, you can see that the blue and orange slopes are different. I simulated in such a way that if crime is more, there will be a smaller number of big houses in such areas. (base rate is also low). Mathematically, I gave negative correlation between area and crime. Hence, omitting crime variable created the OVB.

But, will not it create multi-collinearity issue? It will! But that is a lesser devil to deal with. When your predictor is correlating with both dependent variable and other predictors, you are facing at the OVB — which gives a biased estimate of β. Multi collinearity will make β estimate less efficient (high standard error) — we can deal with it.

Read these amazing notes for more clarity

OVB versus Multicollinearity | ARE Berkeley


Practically speaking, how would you ever know for sure that you didn’t miss any important variable? It’s impossible. So, all our models are subject to this OVB. It is the duty of the statistician to make the model robust to OVB by adding different variables (sometimes even if they don’t add any explanatory value).

Resources:

  1. What Is Omitted Variable Bias? | Marginal University
  2. OVB versus Multicollinearity | ARE Berkeley
  3. Omitted Variable Bias | Economic Theory Blog

要查看或添加评论,请登录

Sai Krishna Dammalapati的更多文章

社区洞察

其他会员也浏览了