Wednesday, February 27, 2013

Endogenous Binary Regressors

* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.

* It is not obvious how to simulate data that will fit the criteria specifications that we desire.

* First let's think about the standard IV setup.

* y = b0 + xb1 + wb2 + u
* w = g0 + zg1 + v

* With u and v distributed normally and independently of x, w, and z.

* We know this setup is generally not correct if either y is binary or w is binary .

* When y is binary we might choose to use a MLE probit estimator with the following assumptions:

* P(y=1) = normal(b0 + xb1 + wb2)
* P(w=1) = normal(g0 + zg1)

* However, it is not easy to generate data in this form.

* Instead we will generate data introducing endogeneity by use of an unobserved addative error.

set obs 1000

gen z = rnormal()
  label var z "Exogenous instrument"

gen v = rnormal()
  label var v "Unobserved error"

gen wp = normal(-.5 + z*1.25 + 1*v)
  label var wp "Endogenous variable"

gen w = rbinomial(1, wp)

* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.
* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2
* Thus, when the probit estimator is run it automatically scales the equation to be unit.
* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.
di -.5/(2^.5)  " for the constant"
di 1.25/(2^.5) " for the coefficient on z"

gen x1 = rnormal()
  label var x1 "Exogenous variable"

probit w z
  * Pretty close estimates to what we expect.

gen yp = normal(1 + w*1.5 + x1 - (2^.5)*v)
  * Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.
  * We need to adjust our estimated coefficients.
  di "Variance of the unobserved =" 1^2 + 2^.5^2
  di "Standard deviation =" (1^2 + 2^.5^2)^.5
  di "Constant coefficient=" 1/(1^2 + 2^.5^2)^.5
  di "x coefficient=" 1/(1^2 + 2^.5^2)^.5
  local w_est =1.5/(1^2 + 2^.5^2)^.5
  di "w coefficient=" `w_est'

gen y = rbinomial(1, yp)

* First let's see what happens if we neglect to make any effort at controlling for the endogeneity.
probit y w x1

test w= `w_est'

* Our estimate of the coefficient on w is way too small

* In order to estimate our relationship in a consistent manner we will use the biprobit command.
* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.
* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.
* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.
biprobit (y = w x1) (w=z x1)
test w= `w_est'

* We can test the "endogeneity" of w by testing the significance of /athrho.  Which appears in this case to be quite significant.

* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.

* This is working well though with 1,000 observations.  It seemed to be extremely ineffective with 100 observations however.

* What do we do if there are multiple endogenous binary variables?

* Let's generate similar data:
set obs 1000

gen z1 = rnormal()
gen z2 = rnormal()

gen v1 = rnormal()
gen v2 = rnormal()

gen x1 = rnormal()

gen wp1 = normal(-.5 + z1*.5 - z2*.2 + .5^.5*v1)
gen w1 = rbinomial(1, wp1)
  label var w1 "Endogenous variable 1"

gen wp2 = normal(.75 - z1*.5 + z2*.2 + .5^.5*v2)
gen w2 = rbinomial(1, wp2)
  label var w2 "Endogenous variable 2"

gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 - .5^.5*v1 + .5^.5*v2)
gen y = rbinomial(1, yp)

* Once again we must adjust our expectation of the coefficients
local var_unob = 1+.5^.5^2+.5^.5^2
di "Variance of unobservables =" `var_unob'
  local est_b0 = 1/(`var_unob')^.5
di "Constant coefficient =" `est_b0'
  local est_w1 = 1/(`var_unob')^.5
di "w1 coefficient coefficient =" `est_w1'
  local est_w2 = .7/(`var_unob')^.5
di "w2 coefficient coefficient =" `est_w2'
  local est_x1 = .5/(`var_unob')^.5
di "x1 coefficient coefficient =" `est_x1'

probit y w1 w2 x1
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* The majority of estimates are not working well.

* Let's try doing the joint MLE

* Previously the biprobit was sufficient.  However, biprobit only allow for a two-way probit.

* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.

* install mvprobit if not yet installed.
* ssc install mvprobit

* The syntax of mvprobit is very similar to that of biprobit

mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* Unfortunately the estimates are still too far away from the true.

* However, they are closer.

* Let's see how the fitted probability line compares with the true.
predict yp_hat
replace yp_hat = normal(yp_hat)

reg yp_hat yp, nocon
predict yp_hat1_hat

two   (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat<1 p="">      yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)

* Overall, not a bad fit.

* This might not be sufficient for many applications.

* What happens when one of the endogenous variables is continuous?

Monday, February 25, 2013

New Project - Stata Blog Aggregator

I have decided to start a Stata blog aggregator.  I believe I am the first.  You can find my preliminary efforts at

The purpose of a blog aggregator is to provide a system by which users are connected with various perspectives and expertise from many bloggers.  If you have favorite Stata bloggers who you think would appreciate the additional publicity from such an aggregator, please suggest that they contact me.

For the blogger, there is very little they have to do.  All I need is a feed from their blog.  The aggregation software will do the rest.

Thursday, February 21, 2013

Poll Results

I received 31 poll responses.

Thanks everybody who responded to the poll!


Wednesday, February 20, 2013

Interpreting the Control Function Coefficient

* Is the control function coefficient a measure of the direction and size of the bias caused by endogeneity?

* Imagine the endogenous variable w being composed of three components: 1 endogenous portion, 2 exogenous portion correlated with z, 3 exogenous portion uncorrelated with z.

set obs 10000

gen z = rnormal()
gen v = rnormal()

gen endogenous = rnormal()
gen exog_with_z = z
gen exog_without_z = rnormal()

gen w = endogenous + exog_with_z + exog_without_z

* Likewise we can think of the error u as composed of both an exogenous portion and an endogenous portion (correlated with part of w)

gen u = endogenous + rnormal()*3

gen y = 1*w + 3*u

reg y w
* We can see that OLS is clearly upward biased

ivreg y (w=z)
* Instrumental variables seems to be working well

* Now for the control function
reg w z

predict v_hat, resid

reg y w v_hat
* I was thinking that the control function coefficient could generally not be interpretted directly the sign of the bias but looking at this simulation it appears I was wrong.

* I will have to do some more thinking on this.

Monday, February 18, 2013

Poll to help improve content

Please consider filling out this very short poll to help improve my content:


People are already responding!

Thanks so much!  I hugely appreciate my readers!

And I appreciate any and all feedback.

After the poll has been running for awhile, I will publicize the results.

Saturday, February 16, 2013

Regression with Endogenous Explanatory Variables

* Imagine you would like to estimate the agricultural production process.

* You have two explanatory variables.  Rain and use of Hybrid or traditional seeds.

* You are concerned that better off (in terms of SES) framers will be more likely to use Hybrid seeds.

* So should you include hybrid as an explanatory variable?

* It explains much of your variation quite well.

* The question boils down to a single question.

* Is the use of hybrid seeds correlated with rainfall?

* This only makes sense of course if people decide to use hybrid seeds if they have some knowledge of season rain in their area and they make the purchase of hybrid seeds based that or if they make the decision of if and when to purchase hybrid seeds based on what they have observed this season so far (and there is some time dependency of rainfall).

* Our model is y = rainfall*br + hybrid*bh + u0

* First let's image the case when hrybid seed choice and rainfall are correlated.

* In that case we can no longer really call even rainfall exogenous.

* That is, rainfall is only exogenous conditional upon controlling for seed choice.

* ie. cov(rainfall, u| hybrid) = 0 because we know cov(rainfall,hrybid) ~= 0

* Why, because excluding hybrid from our model gives us:

* y = rainfall*br + u1    ... where u1 = hyrbid*bh + u0

* So cov(rainfall, u1) = cov(rainfall, u| hybrid) + cov(rainfall, hybrid*bh| hybrid) ~= 0

* This implies that we should include hybrid.

* This is true whenever we have two correlated explanatory variables.  Leaving one out will cause the other explanatory variable to be attributed a portion of the effect of the alternative variable.

* The bias is probably equal to: (x1 and x2 are explanatory variables, both exogenous (when both included) with x2 being correlated with x1.

* The true model: y = x1B1 + x2B2 + u

* Call the linear projection of x1 into x2: L(x2|x1)=A*x1

* Now estimating just y = x1B1 gives us y = x1B1 + L(x2|x1)*B2 + uhat  ... substituting the linear project of x1 on x2 into x2.

* Which reduces to y = x1B1 + A*x1*B2 = (B1 + A*B2)*x1

* The bias is therefore A*B2

* Let's see this in action

* Two "exogenous" variables

set obs 10000

gen x1 = rnormal()

gen x2 = rnormal() + .5*x1    // A = .5
  * Formulating x2 this way is not important for the above result.
  * Any random draw of x1 and x2 that has them covarying together will produce the same results.
  * Only this way it is easy to see that A=.5

gen u = rnormal()*5

gen y = x1*2 + x2*(-3) + u    // B1 = 2,  B2 = -3,  A = .5

* No problem here
reg y x1 x2

* But we will try to not include x2

reg y x1

* Expected coefficient estimate = B1 + A*B2 =  2 + .5*(-3) = .5

* So pretty close.  In this case the bias is equal to A*B2 = -1.5

* One "exogenous" and one "endogenous" variable

* However, we are concerned that hyrbid (x2) is correlated with our error.

* Should we include it?

* The answer is a little subtle.

* If we include it then we will introduce error

set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(rain + ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

* No problem here
reg y rain hybrid ses

* However, we do not have data on ses so:
reg y rain hybrid

* Suddenly big problems.

* But we can still get an unbiased estimate of the effect of rainfall on production in the following manner.
reg y rain

* Rainfall uncorrelated with hybrid and hybrid endogenous

* However, let's see what happens if hybrid choice is endogenous but uncorrelated with rainfall.

set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

reg y rain hybrid
  * The estimate on rainfall is good

reg y rain
  * The estimate on rainfall is not as accurate.  Why?

  * Because less of the unobserved variation is explained.

* Therefore I suggest taking the following steps:

* 1. Think through if any of your variables are endogenous.
* 2. If they are endogenous, estimate their correlation with the explanatory explanatory variables.
* 3. If there is no significant correlation then include them in the regression.  This will give you more precise estimates of the coefficients on the exogenous variables.  Do not interpret the coefficients on the endogenous variables.  Also, do not interpret the F-test for rejection of the model since you know that your endogenous variables are picking up part of the unobserved error.  If you instead want to test model fit, test the joint significance of all of your exogenous variables together with a Wald test.
* 4. If there is a correlation, then it is not clear what to do.  If you exclude endogenous variables correlated with the exogenous variables then this will introduce endogeneity in the the "exogenous variables".
* 5. However, if you introduce the endogenous variable as a control in the equation then you will introduce a level of bias and inconsistency.  I don't think there are any clear rules asking what to do in these circumstances.

Tuesday, February 12, 2013

Fractional Response Estimation

* Often times, many questions that we would like to answer take the form of fractional response data.

* What percentage of students by school meet international standards, what percentage of losing military commanders are brought up on war crimes charges, or what percentage of those in different cities have of getting the flu.

* Using standard OLS methods without accounting for the natural upper and lower bound may lead to biased and inconsistent estimates (if there are build ups around the upper or lower bounds).

* This is due to corr(u,y)!=0 by the same reasoning as with binary response, censored response, and truncated response data.

* For example:

* Simulation 1

set obs 1000

gen x=rnormal()
gen u=rnormal()
* Unobserved heterogeniety

gen y=normal(0+6*x+u)
  label var y "Observed outcome"
* Such a large coefficient on x will drive most of the observations into the extremes.

hist y, frac title(We can see large buildups at 0 and 1)

reg y x
  local B_OLS = _b[x]

* Let's see our predicted outcomes
predict y_hat
  label var y_hat "Predicted from OLS"

* One of the problems we realize immediately is that it is hard to discuss bias or consistency when we don't really know that the true expected change in y is as a result of a change in x.

* Fortunately, this is easy to calculate.

* y=NormalCDF(B0 + xB + u)
* dy/dx = B * NormalPDF(B0 + xB + u)    ... by chain rule
* dy/dx = 2 * NormalPDF(-2 + x*2 + u)

gen partial_x_y = 6 * normalden(0+x*6 + u)
* Of course each effect is unique to each observation currently.

* To find the average we just average.

sum partial_x_y
  * The mean of the partial_x_y shows us that the OLS estimator is too large.
  local B = r(mean)

* To find the "true" linear projection of x on y we need first to identify the B0 coefficient.

sum y
  local y_bar = r(mean)

sum x
  local x_bar = r(mean)

local B0 = `y_bar'-(`x_bar'*`B')

gen true_XB = `B0' + `B'*x
  label var true_XB "Predicted from 'true' LM"

gen linear_u = y-true_XB

corr linear_u x
* We can see that the linear error terms are correlated with x, which is potentially what we expect if x has a positive coefficient.

* This is because as x gets large (approaches infitity for example) then the max(linear_u) must approach 0 since x is driving y to be 1 and u cannot make y larger than 1.

* The argument is identical for as x gets small.  Thus when x is large E(u)<0 e="" is="" small="" u="" when="" while="" x="">0.

* We can see our OLS estimator is about half that of the true (though when the data was more evenly dispursed it seemed the OLS estimator was working well.)

* A greater concern for many might be not only the bias of the estimates but the feasibility of the predictions.

two (scatter y x, sort) (scatter y_hat x, sort) (line true_XB x, sort)

* We can see that the true Average Partial Effect is steeper, being less vulnerable to being adjusted by the outlierers created by Y being bound.

* This is not to say that the true APE is ignoring that for a large portion of the observations changes in x has very little effect on y.

* Only that OLS assumes a particular relationship in the errors which is violated.

* Alternatively we can use a estimator that better fits our data generating process.

glm y x, family(binomial) link(probit) robust

predict y_glm
  label var y_glm "Predicted from GLM"

two (scatter y_glm y) (lfit  y_glm y), title("A perfect fit would be a line that goes from 0,0 to 1,1")

* This looks pretty good.

two (scatter y x, sort) (scatter y_hat x, sort) (line y_glm x, sort col(red))
* The fitted data is much more satisfying as well.

* Notice though, that even if you ignored the fractional responses and just looked at the binary responses, a probit model would probably fit pretty well.

* We should be able to estimate the marginal effects with the margins command
margins, dydx(x)
  di "True APE = `B'"
  di "OLS APE = `B_OLS'"

* Unfortunately, it does not appear that the glm command is doing worse than the OLS estimator at estimating the APEs.

* This is probably because of the model misspecification.  That is, when generating the data, the error term should not be a separate term within the normalCDF.

* However, I am not sure how best to do this otherwise.  If anybody has any ideas I am open to them.

* An alternative might be the following:

* Simulation 2

set obs 1000

gen x=rnormal()

gen yp=normal(3*x+2)

gen ob = rpoisson(100)
  label var ob "Number of individuals"

gen y=rbinomial(ob,yp)/ob
  label var y "Observed outcome"

* This formulation fits better into our concept of how fractional response data works.

* That is there is some number of individual units (in this case 100) which each are being evaluated for either 0 or 1 outcomes.

* By dividing by 100 we reduce the data to a fractional response problem which could be used to evaluate different sized groups since the fractional reponse maintains the same scale.

hist y, frac title(Fractional Responses Look Very Similar)

* Perhaps a little less build up on the tails.

reg y x
  local B_OLS=_b[x]

* Let's see our predicted outcomes
predict y_hat
  label var y_hat "Predicted from OLS"

* Now to esimate the partial effect we use the same formulation as before because
* for the binomial E(y|n,p)=np
* But we transformed y=E(y/n|n,p)=np/n=p
* Thus:
* y=NormalCDF(xB)              ... but now we do not have to worry about the error u.
* dy/dx = B * NormalPDF(xB)    ... by chain rule
* dy/dx = 2 * NormalPDF(x*2)

gen partial_x_y = 3 * normalden(x*3+2)

* To find the average we just average.

sum partial_x_y
  local B = r(mean)
* We can see that the OLS regressor now is too large.

sum y
  local y_bar = r(mean)

sum x
  local x_bar = r(mean)

local B0 = `y_bar'-(`x_bar'*`B')

gen true_XB = `B0' + `B'*x
  label var true_XB "Predicted from 'true' LM"

gen linear_u = y-true_XB

corr linear_u x

two (scatter y x, sort) (line y_hat x, sort) (line true_XB x, sort)

* The only problem with generating data this way is that our data looks too perfect.
* How do we add greater variability into such a formulation?
* A method many might think of would be to add some kind of fixed effect per observation.
* This would suggest that the individuals within each observation are more or less likely to participate based on some unobserved factor which is unique to that observation.
* This is probably not a stretch to think about: People in a city are more likely to get the flu if the flue is already common in the city, Student are more likely to pass an exam if the school is particularly strong or attracts strong students, etc.
* However, if this is the case then the previous data generating process we discussed is more appropriate since u is in effect an unobserved observation level effect.
* In the extreme, as the number of individuals gets large the binomial formulation converges on having the response be the pdf (by the law of large numbers).

glm y x, family(binomial) link(probit) robust
di "Notice the standard errors are about "  e(dispers_p)^.5 " fraction of the those if there were a single draw of the binomail"
* Which is appropriate since there are one average 100 draws, thus the estimated standard error should be about 100 times too large

predict y_glm

margins, dydx(x)
  di "True APE = `B'"
  di "OLS APE = `B_OLS'"
* We can see that now the glm estimator is working better.

* A researcher might be instead tempted to estimate total raw responses as a function of x via OLS.

* In order to do this the researcher would not calculate the fraction of responses.

gen yraw = y*ob
  hist yraw

* The data does not look like it would be crazy to estimate some kind of truncated regression (tobit for example).

* However, given how we know the data has been generated, this would be a mistake.

* The zeros in the data are not truncated values but rather meant to be zeros.

tobit yraw x, ll(0)
* We can adjust the results to our previous scale by dividing the coefficient by 100.

di "Our censored regression estimate is: " _b[x]/100

* This is larger than our previous estimate and biased because it is making the assumption that if our data were not bottom coded at 0 then we would have actually observed negative values for y.

* This is clearly impossible.

* We can further aggravate our mistake by specifying an upper limit as well.

* However, we will make this mistake looking at the original ratios:
tobit y x, ll(0) ul(1)
  predict y_tobit

* This yeilds a large and even more biased estimate.

two (scatter y x) (line y_tobit x)

* Graphically it is easy to see how strong the assumptions would need be to justify use of a censored tobit.

* It is similar to what would have been estimated if we just dropped all values that were equal to 0 or 1.

reg y x if y>0 & y<1 p="">

Monday, February 11, 2013

Non-Parametric Regression Discontinuity

* I recently went to an interesting seminar today by Matias Cattaneo from the University of Michigan.

* He was presenting some of his work on non-parametric regression discontinuity design which I found interesting.

* What he was working on and the conclusions of the paper was interesting but even more interesting was a release by him and coauthors of a Stata package that implements RD design for easy

* net install rdrobust, from( replace

* Regression discontinuity is a technique which allows identification of a localized effect around a natural or structured policy discontinuity.

* For instance, if you wondering what the effect federal grants have on college attendance, then you may be concerned that just looking at those students who are eligible for federal grants in contrast with those who are not eligible will be problematic because students who are eligible (low income) may different than those who are not eligible for the grant (not low income).

* The RD argument is that if individuals do not, as a response to the grant being available, move their reported income level to become eligible for the grant than those who are near the cut off for the grant and those not near the cut off will be fundamentally very similar.

* This may occur if for instance the income cut-off for the grant is unknown.

* So even if students are systematically under-reporting their income, they are not doing it aware of the actual cut off, so the students sufficiently close, above and below the cut off are arguably the "same" or drawn from the same pool except that one group received the program and another group did not.

* The previous post deals some with assuming a linear structure of the underlying characteristics.


* However, the more interesting case (potentially) may be when we assume a nonlinear response to the income in our dependent variable.

* But before going there let's think about what this method boils down to.

* Like all identification methods in statistics or econometrics when we do not have experimental data, identification of an effect is driven by some exogeneity argument.

* That is, x causes y and is unrelated to u (the error).  In the case when u may be correlated with the error the use an exogenous variable to force the movement in the variable of interest may be sufficient to identify a causal effect.

* In this case, clearly it is not enough to simply see what the average y response (GPA, attendance, graduation rates, whatever) is to a change in grant level because those who receive the grants are systematically different from those who do not.

* However, because the cut off for receiving the grant is unknown, around the cut off the two samples who receive the grant and who do not can arguably be considered the same.

* Thus, we could say that the unknown position of the cut off is the random exogenous variable which near the cut off forces some students into the group that receives the grant and some students into the group that does not.

* Let's imagine some non-parametric relationship between income and performance:


set obs 10000

gen income = 3^((runiform()-.75)*4)
  label var income "Reported Income"

  sum income
gen perf0 = ln(income) + sin((income-r(min))/r(max)*4*_pi)/3 + 3
  label var perf0 "Performance Index - Base"

scatter perf0 income

* Looks pretty non-parametric

* Let's add in some random noise
gen perf1 = perf0 + rnormal()*.5
  label var perf1 "Performance Index - with noise"

scatter  perf1 income

* Using the user written command rcspline, we can see the local average performance as a function of income.

* ssc install rcspline

rcspline perf1 income,  nknots(7) showknots title(Cubic Spline)
* I specify "7" knots which are the maximum allowed in the rcspline command.

* The spline seems to fit the generated data well.

* Now let's add a discontinuity at .5.

gen grant = income&lt;.5
sum grant

* So about 50% of our sample is eligible for the grant.

* Now let's add the grant effect.

* First let's generate an income variable that is centered at the grant cut point.
gen income_center = income-.5

gen perf2 = perf1 + .5*grant - .1*income_center*grant
  * Thus the grant is more effective for students with lower income.
  label var perf2 "Observed Performance"

**** Simulation done: Estimation Start ****

rcspline perf2 income,  knots(.15 .25 .35 .37 .4 .45 .5 .55 .6 .65 .75 .85 1.1 1.25 1.5) title(Cubic Spline)
* This is obviously not the ideal plot and I have had some difficulty finding a command which will generate the plot that I would like.

* However, we can see that there does appear to be "something" going on.

reg perf2 income grant
* We can see that our itial estimate of the effect of the grant is entirely wrong.

* It appears so far that the effect of the grant on performance is actually hindering performance (which we know is false).

* Now, let's try our new command rdrobust

rdrobust perf2 income_center
* The default cut point is at 0.  Thus using income_centered works.

* Though this estimate is negative and thus seems the reverse of what we would expect, it is actually working quite well.

* That is because regression discontinuity is trying to identify the effect of the discontinuity on the outcome variable with the default assumption that at the discontinuity the forcing variable is becoming 1.

* In this case however, the discontinuity is really driving the grant to be equal to zero.

* Thus we must inverse the sign on the rd estimator in order to identify the true effect in this case.

* Alternatively, we could switch the sign of income.

gen nincome_center = income_center*(-1)

rdrobust perf2 nincome_center

* rdrobust is a newly designed command that has some extra bells and whistles that other regression discontinuity commands have as well as some oddities.

* I would suggest also looking to the more official stata command rd (ssc install rd)
rd perf2 nincome_center

* This command is nice because it estimates many bandwidths through the mbw option.

* The default mbw is "100 50 200" which means, use the 100 MSE (mean squared error) minimizing bandwidth, half of it and twice it.

* We can plot our estimates of the treatment effect using a range of bandwidths.

gen effect_est = .
  label var effect_est "Estimated Effect"

gen band_scale = .
  label var band_scale "Bandwidth as a Scale Factor of Bandwidth that Minimizes MSE"

forv i = 1/16 {
  rd perf2 nincome_center, mbw(100 `=`i'*25')
    if `i' ~= 4 replace effect_est = _b[lwald`=`i'*25'] if _n==`i'
    if `i' == 4 replace effect_est = _b[lwald] if _n==`i'
    replace band_scale = `=`i'*25'     if _n==`i'  
gen true_effect = .5
  label var true_effect "True effect"

two (scatter effect_est band_scale) (line true_effect band_scale)

* We can see around the 100% MSE bandwidth estimates are fairly steady though they dip a tiny bit.

Thursday, February 7, 2013

2SLS with multiple endogenous variables

* I am wondering if when using 2SLS you must use a multivariate OLS in the reduced form or if you can just do each individual endogenous variable.

* Let's see!

set obs 10000

* First generate the instruments
gen z1 = rnormal()
gen z2 = rnormal()

* Now the error.  u1 and u2 are the parts of w1 and w2 correlated with the error u.
gen u1 = rnormal()
gen u2 = rnormal()
gen u3 = rnormal()*3

gen w1 = 1*z1 + .5*z2 + rnormal() + 2*u1
gen w2 = -.5*z1 + 1.5*z2 + rnormal() - 2*u2 + .2*u1

gen u = u1 + u2 + u3
gen y = 4 + w1*1 + w2*1 + u

* We can see because u is correlated with w1 and w2, OLS will be biased and inconsistent.

reg y w1 w2

* Instrumental variable regression could solve this problem

ivreg y (w1 w2 = z1 z2)

* Let's see about our 2SLS (which should be the same as IV)

reg w1 z1 z2
predict w1_hat

reg w2 z1 z2
predict w2_hat

reg y w1_hat w2_hat
* It seems that 2SLS using separate regressors is producing the same results.

* This is probably because in SOLS if you use the same regressors then I think the coefficients are the same but the standard errors may be adjusted for cross equation correlations (I think I recall).

Wednesday, February 6, 2013

Non-Parametric PDF Fit Test

* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

  * The syntax command parses the input into useful macros used for later calculations.
  syntax  varlist, Tgenerate(numlist >= 100) [Dist(string)]
  if "`dist'"=="" local dist="normal"
  if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
  if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
  if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

  di "Simulate `tgenerate' draws"
  * Construct a t-vector in Mata to store estimation results.
  mata: tdist = J(`tgenerate', 1, .)
  * This will randomly draw from the distribution being tested a number of times equal to tgenerate.
  qui: drop if `varlist' == .
  forv i=1(1)`tgenerate' {
    tempvar x zx z p d

    if "`dist'"=="normal" qui: gen `x' = rnormal()

if "`dist'"=="poisson" {
      qui: sum `varlist'
      qui: gen `x' = rpoisson(r(mean))

if "`dist'"=="uniform" qui: gen `x' = runiform()

    sort `x'
    qui: gen `p' = (_n-.5)/_N

if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
 * The invpoisson distribution in Stata is misspecified.  1-p is neccessary.
      qui: gen `d' = (`z'-`x')^2

if "`dist'"=="uniform" {
      qui: gen `z' = _n/_N
      qui: gen `d' = (`z'-`x')^2
qui: reg `d'
    local t = _b[_cons]/_se[_cons]
    mata: tdist[`i', 1]=`t'

drop `x' `z' `p' `d'
cap drop `zx'
  * From the above loop we should have a vector of t values.
  * We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.
  mata: tsorted = sort(tdist, 1)

  * One tailed test
  local c90 = floor(`tgenerate'*.90)+1
  mata: st_local("t90", strofreal(tsorted[`c90']))
  local c95 = floor(`tgenerate'*.95)+1
  mata: st_local("t95", strofreal(tsorted[`c95']))

  local c99 = floor(`tgenerate'*.99)+1
  mata: st_local("t99", strofreal(tsorted[`c99']))
  * Two Tailed
    * 90% CI
  local c90U = floor((`tgenerate'+.5)*.95)+1
  mata: st_local("t90U", strofreal(tsorted[`c90U']))
  local c90L = ceil((`tgenerate'+.5)*.05)-1
  mata: st_local("t90L", strofreal(tsorted[`c90L']))
    * 95% CI
  local c95U = floor((`tgenerate'+.5)*.975)+1
  mata: st_local("t95U", strofreal(tsorted[`c95U']))

  local c95L = ceil((`tgenerate'+.5)*.025)-1
  mata: st_local("t95L", strofreal(tsorted[`c95L']))

    * 99% CI
  local c99U = floor((`tgenerate'+.5)*.99)+1
  mata: st_local("t99U", strofreal(tsorted[`c99U']))

  local c99L = ceil((`tgenerate'+.5)*.01)-1
  mata: st_local("t99L", strofreal(tsorted[`c99L']))

  ** Now we do the estimation
    tempvar x zx z p d

qui: gen `x' = `varlist'

    sort `x'
    qui: gen `p' = (_n-.5)/_N

  * We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
      qui: gen `d' = (`z'-`x')^2

if "`dist'"=="uniform" {
      qui: sum `x'
      qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
      qui: gen `d' = (`z'-`x')^2

* This is the regression of interest.
    qui: reg `d'
    local t = _b[_cons]/_se[_cons]

* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t:  `: di %9.3f `t''" _newline

di "One-way Analysis"
di "  CI   (1%) :    0.000   to `: di %9.3f `t99''"
di "  CI   (5%) :    0.000   to `: di %9.3f `t95''"
di "  CI   (10%):    0.000   to `: di %9.3f `t90''"

    mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
    mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2

di "One-sided  p:`: di %9.3f `p1''   (`position' out of `tgenerate')"

di _newline "Two-way Analysis"
di "  CI   (1%): `: di %9.3f `t99L''    to `: di %9.3f `t99U''"
di "  CI   (5%): `: di %9.3f `t95L''    to `: di %9.3f `t95U''"
di "  CI  (10%): `: di %9.3f `t90L''    to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"

    return scalar p1 = `p1'
    return scalar p2 = `p2'


* Let's see how this works on a sample draw.
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
  syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
  set obs `obs'
  gen x = `rdraw'
  dist_check x, t(`reps') dist(`dist')

* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

* mean r1_10 and  mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.

* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.

* This looks great.

* Now let's see about the test's power to reject the null when the true distribution is not the null.

checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1


checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1


* I think there might be some slight errors.  I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.

Tuesday, February 5, 2013

Test of PDF Fit - Does not work

* I came up with this idea for testing if an observed distribution could be tested.

* The idea is that it standardized your data of interest and then tests if when the data is ordered if the difference between the empirical distribution of our data and your null distribution.

* The following program will sort the data and construct a pdf of our null (the normal in this case).
cap: program drop normal_check
program define normal_check
  * Preverse the input data
  * Sort from smallest to largest
  sort x

  * Standardize x
  egen xnew = std(x)

  * Create a CDF of what each obsrvations probability is under the null.
  gen p = (_n-.5)/_N

  * Create a pdf projection of what each bin's expected value is under the null.
  gen targ_x = invnormal(p)

  * Calculate the difference (which is what we will be testing.
  gen dif_norm = xnew-targ_x

  * Regress the difference on a constant
  reg dif_norm

  * If the option "graph" was specified then graph the hypathetical and the observed densities.
  if "`1'" == "graph" twoway kdensity xnew || kdensity x
  * Restore the data to how it was before initiating the program

* Now let's generate some sample data

set obs 100
gen x = rnormal()

normal_check graph
* We can see that our PDF fitness test has failed.

* Perhaps if we bootstrap the process?
bs: normal_check

* We reject the null (though we know it is true).

* Nope.  Why not?

* The sorting mechanism is a violation of the independence assumption.

* We need to generate alternative test statistics that will adjust for non-random sorting.

* My post tomorrow will demonstrate a response to this post which is generally effective at detecting underlying distributions.

* Distribution detection however is nothing new in statistics.

* Look up the ksmirnov command.  There is also Anderson-Darling and Chi-Squared options though I do not know what is coded in Stata or what their syntax looks like.

Monday, February 4, 2013

Potential IV Challenges even with RCTs

* You design an RCT with a single experimental binary treatment.

* But you think your instrument has the power to influence multiple endogenous response variables.

* We know we cannot use a single instrument to instrument for more than one variable jointly.

* Does it still work as a valid instrument if it affects other variables that can in turn affect our outcome?

set obs 1000

gen z = rbinomial(1,.5)

gen end1 = rnormal()
gen end2 = rnormal()
* end1 and end2 are the endogenous component of x1 and x2 which will bias our results if we are not careful.

gen x1 = rnormal()+z+end1
gen x2 = rnormal()+z+end2

gen u = rnormal()*3

gen y = x1 + x2 + end1 + end2 + u

reg y x1 x2
* Clearly the OLS estimator is biased.

ivreg y (x1=z)
ivreg y (x2=z)
* Trying to use the IV individually only makes things worse.

ivreg y x1 (x2=z)
* Trying to control for x1 does not help.

* So what do we do?

* If we have control over how the instrument is being constructed then we think things through carefully making sure that if we are interested on the effect of x1 on y then our instrument only results in a direct effect and no additional affect on x2.

* Unfortunately, this task can not always be feasible.

* Another consideration we should take into account is potential variable responses to the istrument.

* For instance:

set obs 1000

gen z = rbinomial(1,.5)

gen end1 = rnormal()
* end1 and end2 are the endogenous component of x1 and x2 which will bias our results if we are not careful.

gen g1 = rnormal()
gen grc = g1+1
  label var grc "Random Coefficient on the Instrument"

gen x1 = rnormal()+z*grc+(end1)

gen u = rnormal()*3

gen brc = g1+1
  label var brc "Random Coefficient on the Endogenous Variable"

gen y = x1*brc + end1 + u

reg y x1
* OLS is still biased.

ivreg y (x1=z)
* Unfortunatley, now too is IV.

* This is because, though our instrument is uncorrelated our errors, the response to the instrument is variable and that response may also be correlated with a variable response on the variable of interest.

* Thus, though corr(u,z)=0 and cor(x,z)!=0, we can still have problems.

* This particular problem is actually something that I am working on with Jeffrey Wooldridge and Yali Wang.

Saturday, February 2, 2013

Chamberlain Mundlak Device and the Cluster Robust Hausman Test

* Unobserved variation can be divided in a useful manner.

* y_it = X_it*B + c_i + u_it

* c_i is fixed individual effect or random individual effect if c_i is uncorrelated with Xi (the time averages of X_it).

* In order for c_it to bias B is if there is some correlation between X_it and c_i.

* Therefore if we were to create a new variable which "controls" any time constant variation in X_it then the remaining v_i must be uncorrelated with X_it.

* Thus the Chamberlain Mundlak Device was born.

* Let's see it in action!


set obs 100

gen id = _n

gen A1 = rnormal()
gen A2 = rnormal()

* Let's say we have 2 observations per individual
expand 2

gen x = rnormal()+A1
gen u = rnormal()*3

gen y = -5 +2*x + A1 + A2 + u

* Now in the above model there is both a portion of the unobserved variance correlated with the average x (A1) and a random portion uncorrelated with the average x (A2) the individual level.

* The fixed effect varies with the x variable while the random one does not.

* The standard approach in this case would be to use the Hausman test to differentiate between fixed effect and random effect models.

xtset id
* Let's first set id as the panel data identifier.

xtreg y x, fe
estimates store fe
* We store the estimates for use in the Hausman test

xtreg y x, re

hausman fe, sigmamore
* We strongly reject the null which we should expect so in classical econometric reasoning we choose to use the fixed effect estimator.

* An alternative method of estimating the fe estimator is by constructing the Chamberlain-Mundlak device.

* This device exploits the knowledge that the only portion of the time constant variation in X that can be correlated with u must be correlated only with the time average X for each individual.

bysort id: egen x_bar = mean(x)

reg y x x_bar

* Amazingly we can see that the new estimator is the same as the fe estimator above.

* Notice however the degrees of freedom.

* In the fe esimator we have used up half of our degrees of freedom.

* Yet our x estimate is the same size and our standard errors are very similar?

* If we double our sample size should not our standard errors decrease substantially?

* The answer is no. Why?

* I am going to run the fixed effect estimator manually.

reg y x

* Look at our SSE or the R2.  In the fixed effect model the R2 is much larger.

* This is because in terms of the random effect (A2), the fixed effect model controls for both the portion of the unobserved individual level variance which is correlated with the average x for each student as well as that portion uncorrelated with the average x.

* The Chamberlain-Mundlak (CM) device however only controls for the portion of the variance correlated with the average Xs.  Thus there is much more unexplained variance which ends up reducing the power of our test which is approximately accounted for when we adjust the sample size.

* The CM can be additionally useful because it provides an alternative form of the Hausman test.

reg y x x_bar

* The significance of the generated regressor x_bar indicates the exogeneity of the unobserved individual effects.

* The test can be easily adjusted to be robust to cluster effects by specifying cluster in the regression.

reg y x x_bar, cluster(id)

Friday, February 1, 2013

Macro Parsing

Do file

* In Stata when programming commands we often want to be able to be able to develop our own syntax for our own commands.

* The command "syntax" is often at providing a method for parsing command inputs into easily managed forms.

* However, sometimes we may want syntax options not permitted by the command "syntax".

* If we were to program our own instrumental variable regression that looked like Stata's these limitations might cause problems.

* myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)

* This is one method how this kind of parsing could be accomplished using the gettoken command.

* Gettoken parses or breaks a macro when a certain character or set of characters is input.

* Let's see it in action

local example1 abc def ghi,jkl
di "`example1'"

* We can see that the local example1 is a disorganized string

* We can seperate it into before and after the space with the following commands.

gettoken before after: example1

di "`before'"

di "`after'"

* The default parse character is " "

* We can specify an alternative parse character or set of characters using the parse option.

gettoken before after: example1, parse(",")

di "`before'"

di "`after'"

* Notice that the parse character is included in the after parse

* Let's see gettoken in: myivreg y x1 x2 (w1 w2 w3 = z1 z2 z3)

cap program drop myivreg
program define myivreg

  di "The local 0 contains the entire command line: `0'"

  gettoken main options: 0 , parse(",")

  gettoken dep_exog brackets: 0, parse("(")

  gettoken dep_var exog : dep_exog

  di "Dependent Variables: `dep_var'"
  di "Exogenous Variables: `exog'"

  gettoken cntr: brackets, parse(")")

  local cntr=subinstr("`cntr'", "(", "", .)

  di "Inside brackets: `cntr'"

  gettoken endog inst: cntr, parse("=")
  local inst=subinstr("`inst'", "=", "", .)

  di "Endogenous Variables: `endog'"
  di "Instrumental Variables: `inst'"

  * At this point we have all of the main pieces of elements we would need to run our ivreg.

  ivreg `dep_var' `exog' (`endog' = `inst')


* Let's generate some dummy data
set obs 100

* This will generate some strange data:
*  x2 -> z3 -> w0 -> z2 -> w1 -> z1 -> w2 -> x1 -> y
* Where -> means causes.  This is not a valid iv regression data
foreach v in  x2 z3 w0 z2 w1 z1 w2 x1 y {
  gen `v' = rnormal() `corr_vars'
  local corr_vars ="+`v'"

myivreg y (w0=z1)

myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3)

* Everthing seems to be working well.
myivreg y x1 x2 (w0 w1 w2 = z1 z2 z3), test

* However, adding the option "test" does not cause a problem.  Is this a problem.  Not neccessarily.

* We just have not specified any code to ensure that unused syntax is accounted for.

* Combining the gettoken command with the syntax command can accomplish this.

cap program drop myivreg2
program define myivreg2

  di "The local 0 contains the entire command line: `0'"

  * This is a new bit of code that ensures that no options will be included
  syntax anything

  gettoken dep_exog brackets: anything, parse("(")

  gettoken dep_var exog : dep_exog

  di "Dependent Variables: `dep_var'"
  di "Exogenous Variables: `exog'"

  gettoken cntr: brackets, parse(")")

  local cntr=subinstr("`cntr'", "(", "", .)

  di "Inside brackets: `cntr'"

  gettoken endog inst: cntr, parse("=")
  local inst=subinstr("`inst'", "=", "", .)

  di "Endogenous Variables: `endog'"
  di "Instrumental Variables: `inst'"

  * At this point we have all of the main pieces of elements we would need to run our ivreg.

  ivreg `dep_var' `exog' (`endog' = `inst')


myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3)

myivreg2 y x1 x2 (w0 w1 w2 = z1 z2 z3), test
* Now does not work.