* Regression discontinuity is a method of analysis dating back to work by Thistlewait and Cook (1960) but recently popularized by a number of important papers such as Hahn, Todd, and Van der Klaauw (2001) <http://ideas.repec.org/a/ecm/emetrp/v69y2001i1p201-09.html>
* The method is argued to require weaker assumptions than natural experiments.
* The method is seemingly deceptively simple.
* Imagine there is some rule implemented at z = c a heterogenous value in the population.
* z could be correlated with the outcome variable y of interest. However, if one were to look at the group of individuals whose value of z were sufficiently close to c then one would find that the only remaining difference would be the result of either recieving the treatment T or not recieving the treatment.
* Let's see how this works.
set obs 20000
set seed 101
* Imagine that a school has 20000 incoming students.
* They have SAT scores drawn from a uniform distribution ( not a realistic assumption)
gen SAT = 600 + int(181*uniform())*10
* As an administrator you would like to give out merit based scholarships to encourage students to do well at your school.
* This we will call a score of 2130.
recode SAT (0/2130=0) (2130/2400=1), gen(scholarship)
* There is also some measurable level of mentoring that affects performance which is independent of SAT scores and scholarship.
gen mentoring = rbinomial(1,.5)
* Let's also imagine that students with top SAT scores are more likely to do well without the scholarship.
gen performance = 25 + 2*(SAT/1500)^3 + 1*scholarship + 1*mentoring + rnormal()*5
twoway (scatter performance SAT , msize(tiny) msymbol(circle)) ///
(lfit performance SAT if SAT<2130 circle="circle" msize="msize" msymbol="msymbol" nbsp="nbsp" p="p" tiny="tiny"> (lfit performance SAT if SAT>2130, msize(tiny) msymbol(circle)), ///
legend(label(2 "No Scholarship") label(3 "Scholarship"))
* Performance is some index that your team has developed that combines grades, time to completion, post graduation job success, entry to graduate schools, as well as alumni contributions.
* You think that the relationship between SAT and performance is nonlinear
* Sepecification 0:
reg performance SAT scholarship mentoring
* However, you are suspicious of the relationship between students recieving the scholarship and future success.
* Thus RD! You look instead at those students who almost got the scholarship and those who just barely qualified for the scholarship.
* Sepecification 1:
reg performance SAT scholarship mentoring if SAT > 1930 & SAT < 2330
* We can see that our estimates are closer. However they are not perfect yet it does not help if we restrict our data further.
* Sepecification 2:
reg performance SAT scholarship mentoring if SAT > 2070 & SAT < 2180
* Sepecification 3:
reg performance SAT scholarship mentoring if SAT > 2100 & SAT < 2160
* This is because we rapidly loose observations as our
* Sepecification 4:
reg performance SAT scholarship mentoring if SAT > 2115 & SAT < 2145
* It is easy to see the problem with the RD approach in this example. RD is highly sensitive to sample selection.
* If our selection is too narrow then we do not have enough data to identify the effect or the confidence interval for useful analysis.
* In 0 we can see that our confidence interval does not enclose the true parameter estimate, while in all other specifications it does.
* However, from specifications 2-4 we might be forced to conclude that after using RD "precision", the scholarship had no effect size significantly difference from zero.
* We can that the estimated effect of mentoring suffers equally bad as that of estimating the effect of scholarship by restricting the sample.
* Ultimately using RD to restrict the sample is going to have an equally deliterious effect on other coefficients of interest.
* This might sound overall negative. However, when you have large enough sample sizes things start improving.
* Let's imagine that the administrator is able to use multiple years of students to estimate the effect of the program.
* Increase sample size to 2 or 3 times the current sample size and include a year fixed effect and the RD estimator is still going to outperform the biased estimator which cannot get better through the inclusion of more data.