## Wednesday, February 27, 2013

### Endogenous Binary Regressors

* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.

* It is not obvious how to simulate data that will fit the criteria specifications that we desire.

* First let's think about the standard IV setup.

* y = b0 + xb1 + wb2 + u
* w = g0 + zg1 + v

* With u and v distributed normally and independently of x, w, and z.

* We know this setup is generally not correct if either y is binary or w is binary .

* When y is binary we might choose to use a MLE probit estimator with the following assumptions:

* P(y=1) = normal(b0 + xb1 + wb2)
* P(w=1) = normal(g0 + zg1)

* However, it is not easy to generate data in this form.

* Instead we will generate data introducing endogeneity by use of an unobserved addative error.

clear
set obs 1000

gen z = rnormal()
label var z "Exogenous instrument"

gen v = rnormal()
label var v "Unobserved error"

gen wp = normal(-.5 + z*1.25 + 1*v)
label var wp "Endogenous variable"

gen w = rbinomial(1, wp)

* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.
* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2
* Thus, when the probit estimator is run it automatically scales the equation to be unit.
* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.
di -.5/(2^.5)  " for the constant"
di 1.25/(2^.5) " for the coefficient on z"

gen x1 = rnormal()
label var x1 "Exogenous variable"

probit w z
* Pretty close estimates to what we expect.

gen yp = normal(1 + w*1.5 + x1 - (2^.5)*v)
* Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.
* We need to adjust our estimated coefficients.
di "Variance of the unobserved =" 1^2 + 2^.5^2
di "Standard deviation =" (1^2 + 2^.5^2)^.5
di "Constant coefficient=" 1/(1^2 + 2^.5^2)^.5
di "x coefficient=" 1/(1^2 + 2^.5^2)^.5
local w_est =1.5/(1^2 + 2^.5^2)^.5
di "w coefficient=" `w_est'

gen y = rbinomial(1, yp)

* First let's see what happens if we neglect to make any effort at controlling for the endogeneity.
probit y w x1

test w= `w_est'

* Our estimate of the coefficient on w is way too small

* In order to estimate our relationship in a consistent manner we will use the biprobit command.
* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.
* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.
* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.
biprobit (y = w x1) (w=z x1)
test w= `w_est'

* We can test the "endogeneity" of w by testing the significance of /athrho.  Which appears in this case to be quite significant.

* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.

* This is working well though with 1,000 observations.  It seemed to be extremely ineffective with 100 observations however.

* What do we do if there are multiple endogenous binary variables?

* Let's generate similar data:
clear
set obs 1000

gen z1 = rnormal()
gen z2 = rnormal()

gen v1 = rnormal()
gen v2 = rnormal()

gen x1 = rnormal()

gen wp1 = normal(-.5 + z1*.5 - z2*.2 + .5^.5*v1)
gen w1 = rbinomial(1, wp1)
label var w1 "Endogenous variable 1"

gen wp2 = normal(.75 - z1*.5 + z2*.2 + .5^.5*v2)
gen w2 = rbinomial(1, wp2)
label var w2 "Endogenous variable 2"

gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 - .5^.5*v1 + .5^.5*v2)
gen y = rbinomial(1, yp)

* Once again we must adjust our expectation of the coefficients
local var_unob = 1+.5^.5^2+.5^.5^2
di "Variance of unobservables =" `var_unob'
local est_b0 = 1/(`var_unob')^.5
di "Constant coefficient =" `est_b0'
local est_w1 = 1/(`var_unob')^.5
di "w1 coefficient coefficient =" `est_w1'
local est_w2 = .7/(`var_unob')^.5
di "w2 coefficient coefficient =" `est_w2'
local est_x1 = .5/(`var_unob')^.5
di "x1 coefficient coefficient =" `est_x1'

probit y w1 w2 x1
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* The majority of estimates are not working well.

* Let's try doing the joint MLE

* Previously the biprobit was sufficient.  However, biprobit only allow for a two-way probit.

* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.

* install mvprobit if not yet installed.
* ssc install mvprobit

* The syntax of mvprobit is very similar to that of biprobit

mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'

* Unfortunately the estimates are still too far away from the true.

* However, they are closer.

* Let's see how the fitted probability line compares with the true.
predict yp_hat
replace yp_hat = normal(yp_hat)

reg yp_hat yp, nocon
predict yp_hat1_hat

two   (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat<1 p="">      yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)

* Overall, not a bad fit.

* This might not be sufficient for many applications.

* What happens when one of the endogenous variables is continuous?