Monday, February 11, 2013

Non-Parametric Regression Discontinuity

* I recently went to an interesting seminar today by Matias Cattaneo from the University of Michigan.

* He was presenting some of his work on non-parametric regression discontinuity design which I found interesting.

* What he was working on and the conclusions of the paper was interesting but even more interesting was a release by him and coauthors of a Stata package that implements RD design for easy

* net install rdrobust, from( replace

* Regression discontinuity is a technique which allows identification of a localized effect around a natural or structured policy discontinuity.

* For instance, if you wondering what the effect federal grants have on college attendance, then you may be concerned that just looking at those students who are eligible for federal grants in contrast with those who are not eligible will be problematic because students who are eligible (low income) may different than those who are not eligible for the grant (not low income).

* The RD argument is that if individuals do not, as a response to the grant being available, move their reported income level to become eligible for the grant than those who are near the cut off for the grant and those not near the cut off will be fundamentally very similar.

* This may occur if for instance the income cut-off for the grant is unknown.

* So even if students are systematically under-reporting their income, they are not doing it aware of the actual cut off, so the students sufficiently close, above and below the cut off are arguably the "same" or drawn from the same pool except that one group received the program and another group did not.

* The previous post deals some with assuming a linear structure of the underlying characteristics.


* However, the more interesting case (potentially) may be when we assume a nonlinear response to the income in our dependent variable.

* But before going there let's think about what this method boils down to.

* Like all identification methods in statistics or econometrics when we do not have experimental data, identification of an effect is driven by some exogeneity argument.

* That is, x causes y and is unrelated to u (the error).  In the case when u may be correlated with the error the use an exogenous variable to force the movement in the variable of interest may be sufficient to identify a causal effect.

* In this case, clearly it is not enough to simply see what the average y response (GPA, attendance, graduation rates, whatever) is to a change in grant level because those who receive the grants are systematically different from those who do not.

* However, because the cut off for receiving the grant is unknown, around the cut off the two samples who receive the grant and who do not can arguably be considered the same.

* Thus, we could say that the unknown position of the cut off is the random exogenous variable which near the cut off forces some students into the group that receives the grant and some students into the group that does not.

* Let's imagine some non-parametric relationship between income and performance:


set obs 10000

gen income = 3^((runiform()-.75)*4)
  label var income "Reported Income"

  sum income
gen perf0 = ln(income) + sin((income-r(min))/r(max)*4*_pi)/3 + 3
  label var perf0 "Performance Index - Base"

scatter perf0 income

* Looks pretty non-parametric

* Let's add in some random noise
gen perf1 = perf0 + rnormal()*.5
  label var perf1 "Performance Index - with noise"

scatter  perf1 income

* Using the user written command rcspline, we can see the local average performance as a function of income.

* ssc install rcspline

rcspline perf1 income,  nknots(7) showknots title(Cubic Spline)
* I specify "7" knots which are the maximum allowed in the rcspline command.

* The spline seems to fit the generated data well.

* Now let's add a discontinuity at .5.

gen grant = income<.5
sum grant

* So about 50% of our sample is eligible for the grant.

* Now let's add the grant effect.

* First let's generate an income variable that is centered at the grant cut point.
gen income_center = income-.5

gen perf2 = perf1 + .5*grant - .1*income_center*grant
  * Thus the grant is more effective for students with lower income.
  label var perf2 "Observed Performance"

**** Simulation done: Estimation Start ****

rcspline perf2 income,  knots(.15 .25 .35 .37 .4 .45 .5 .55 .6 .65 .75 .85 1.1 1.25 1.5) title(Cubic Spline)
* This is obviously not the ideal plot and I have had some difficulty finding a command which will generate the plot that I would like.

* However, we can see that there does appear to be "something" going on.

reg perf2 income grant
* We can see that our itial estimate of the effect of the grant is entirely wrong.

* It appears so far that the effect of the grant on performance is actually hindering performance (which we know is false).

* Now, let's try our new command rdrobust

rdrobust perf2 income_center
* The default cut point is at 0.  Thus using income_centered works.

* Though this estimate is negative and thus seems the reverse of what we would expect, it is actually working quite well.

* That is because regression discontinuity is trying to identify the effect of the discontinuity on the outcome variable with the default assumption that at the discontinuity the forcing variable is becoming 1.

* In this case however, the discontinuity is really driving the grant to be equal to zero.

* Thus we must inverse the sign on the rd estimator in order to identify the true effect in this case.

* Alternatively, we could switch the sign of income.

gen nincome_center = income_center*(-1)

rdrobust perf2 nincome_center

* rdrobust is a newly designed command that has some extra bells and whistles that other regression discontinuity commands have as well as some oddities.

* I would suggest also looking to the more official stata command rd (ssc install rd)
rd perf2 nincome_center

* This command is nice because it estimates many bandwidths through the mbw option.

* The default mbw is "100 50 200" which means, use the 100 MSE (mean squared error) minimizing bandwidth, half of it and twice it.

* We can plot our estimates of the treatment effect using a range of bandwidths.

gen effect_est = .
  label var effect_est "Estimated Effect"

gen band_scale = .
  label var band_scale "Bandwidth as a Scale Factor of Bandwidth that Minimizes MSE"

forv i = 1/16 {
  rd perf2 nincome_center, mbw(100 `=`i'*25')
    if `i' ~= 4 replace effect_est = _b[lwald`=`i'*25'] if _n==`i'
    if `i' == 4 replace effect_est = _b[lwald] if _n==`i'
    replace band_scale = `=`i'*25'     if _n==`i'  
gen true_effect = .5
  label var true_effect "True effect"

two (scatter effect_est band_scale) (line true_effect band_scale)

* We can see around the 100% MSE bandwidth estimates are fairly steady though they dip a tiny bit.

No comments:

Post a Comment