Clarifications into Regression Discontinuity Design (RDD)

Clarifications into Regression Discontinuity Design (RDD)

I came across one RDD study last week where observational data was used to find the causal link between air pollution and life expectancy. Here is that paper: New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy.

The key evidence is this: sustained exposure to an additional 10 μg/m3 of PM10 reduces life expectancy by 0.64 years.

This evidence is used to build the AQLI tool by the The Energy Policy Institute at the University of Chicago (EPIC).

You should check out this tool to appreciate how statistical evidence can be put to use.


I got curious to learn more about the RDD after reading the paper. I’m not going to write about how to do an RDD. This blog by Arun Subramanian did a thorough job with that. But I had a few questions about RDD and this blog is the answers I found about it.

As a preface for those who don’t know about RDD, it is a method to find causality using observational data (without conducting an experiment). We identify certain data points as the treatment group and certain points as the control group based on some threshold over a “running variable” (also called forcing variable). We perform two regressions, one on each group. We compare the predictions from both regressions right at the threshold value. If there is a jump (discontinuity) we infer causality, and the jump is the causal estimate.

Picture from Arun's blog

For example, we will consider the oft-taken example in RDD. Imagine a class of 100 students who gave a test (pre-treatment). Students scored between 0–100 [RUNNING VARIABLE]. The teacher wants to help the under-performing kids. She took tuitions [TREATMENT] for those who scored less than 40 [THRESHOLD]. After a month of tuitions, we conduct another test for the class. In RDD, we ask this question: Were the tuitions helpful?


Q 1A) Why can’t we just compare means of the treatment and control groups to find the Average Treatment Effect?

Why cannot we just compare means of those who scored less than 40 in pre-treatment test with others?

When we conduct experiments, it would be sufficient to calculate the mean outcome of both treatment and control groups. If the means vary, we can infer causality. This is because we do randomization in an experiment.

But in RDD, we are not actually conducting an experiment. There is no randomization. There is a rule-based assignment of the treatment. Hence, there is a problem in comparing means.

We have all high scorers (>40) in control group, who would score high in the post-treatment test as well. The control group mean would be more than that of treatment group (even if treatment works).

Q 1B) Then why can’t we just compare means of a bandwidth region around the threshold?

In RDD, we go with the assumption that the groups just before and after the threshold are comparable. So, why cannot I compare scores of those who scored 35–40 in the pre-treatment test with those who scored 40–45 in the pre-treatment test? I’d overcome the problem of extremely high scorers in control group.

Yes, but there is still another issue with comparing means.

It is possible that those who scored less in the pre-treatment test score more now out of guilt and not necessarily due to tuitions. So those who scored 35–40 would have a higher mean that those scored 40–45, even without treatment? How can we not claim causality to the treatment in that case then? — By doing Regression!

Consider the case where treatment did not actually took place. 35–40 group scored more now out of guilt (their mean is higher). But when we draw regression lines for both 35–40 and 40–45 groups, we can see that there is no jump (discontinuity) at the threshold.

Hence, we cannot just compare means and claim causality. Instead, we regress the outcome with the variable that is used to determine treatment (pre-treatment score here). Bias can be eliminated with that.

Nevertheless, comparison of means is a first simple step before getting into regressions. Just don’t assume causality with it.


Q2) Why go for Weighted Least Squares (WLS) Regression or non-parametric methods like Local Linear Regression (LLR)?

In the above question, we learnt why regressions have to be done. But the ball doesn’t stop there. In RDD, people do WLS or LLR for more robust causal estimates. Why?

Q1A has some explanation to this. We want the regression model to predict the value at the threshold. Values far away from threshold can influence the OLS model and give inaccurate estimates for causality. Hence, WLS regression model (with more weight to points close to the threshold) would help.

Bigger problem is that non-linearity can be mistaken for regression discontinuity and thus causality. Consider the following figure. The OLS regression lines (solid) are showing a jump (discontinuity) and we would infer causality. But, if we do a WLS or an LLR, we wouldn’t see those jumps (dashed line).

Source: Mostly Harmless Econometrics

Non-parametric methods are also more favored as OLS regression estimates would be more influenced by Omitted Variable Bias.


Q3) Are other co-variates needed in the regression?

RDD by itself helps you in making causal inference with the running variable alone. But you’d like to include more co-variates for other reasons:

  1. Precision: If you include more variables, the precision of the regression estimates would increase (smaller standard errors). Hence, better causal estimates.
  2. Non-linearities: Non-linearities in the fit could be taken care of by adding more variables (useful when you want to do OLS).
  3. Robustness check: RDD lies on the assumption that there is no other variable that jumps near the threshold. If some other variable also jumps, the causality could be attributed to that as well. So, adding all relevant co-variates makes RDD estimate more robust for causal inference.
  4. External validity: A model with co-variates can be extended to other populations as well.


Q4) Why people rarely do residual analysis while performing RDD?

In regressions, residual analysis is important to ensure that regression assumptions are held. But people who do RDD, do not do much of residual analysis.

This is because we are not interested in prediction of individual outcome values at the threshold. We are happy predicting the average outcome at the threshold. For this, a robust standard error serves the purpose. Normality of errors, etc., are not important.


These are a few clarifications I could seek in my RDD learning. If you have more questions, please shoot it to me and I’ll write about them!


要查看或添加评论,请登录