* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.

* It is not obvious how to simulate data that will fit the criteria specifications that we desire.

* First let's think about the standard IV setup.

* y = b0 + xb1 + wb2 + u

* w = g0 + zg1 + v

* With u and v distributed normally and independently of x, w, and z.

* We know this setup is generally not correct if either y is binary or w is binary .

* When y is binary we might choose to use a MLE probit estimator with the following assumptions:

* P(y=1) = normal(b0 + xb1 + wb2)

* P(w=1) = normal(g0 + zg1)

* However, it is not easy to generate data in this form.

* Instead we will generate data introducing endogeneity by use of an unobserved addative error.

clear

set obs 1000

gen z = rnormal()

label var z "Exogenous instrument"

gen v = rnormal()

label var v "Unobserved error"

gen wp = normal(-.5 + z*1.25 + 1*v)

label var wp "Endogenous variable"

gen w = rbinomial(1, wp)

* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.

* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2

* Thus, when the probit estimator is run it automatically scales the equation to be unit.

* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.

di -.5/(2^.5) " for the constant"

di 1.25/(2^.5) " for the coefficient on z"

gen x1 = rnormal()

label var x1 "Exogenous variable"

probit w z

* Pretty close estimates to what we expect.

gen yp = normal(1 + w*1.5 + x1 - (2^.5)*v)

* Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.

* We need to adjust our estimated coefficients.

di "Variance of the unobserved =" 1^2 + 2^.5^2

di "Standard deviation =" (1^2 + 2^.5^2)^.5

di "Constant coefficient=" 1/(1^2 + 2^.5^2)^.5

di "x coefficient=" 1/(1^2 + 2^.5^2)^.5

local w_est =1.5/(1^2 + 2^.5^2)^.5

di "w coefficient=" `w_est'

gen y = rbinomial(1, yp)

* First let's see what happens if we neglect to make any effort at controlling for the endogeneity.

probit y w x1

test w= `w_est'

* Our estimate of the coefficient on w is way too small

* In order to estimate our relationship in a consistent manner we will use the biprobit command.

* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.

* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.

* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.

biprobit (y = w x1) (w=z x1)

test w= `w_est'

* We can test the "endogeneity" of w by testing the significance of /athrho. Which appears in this case to be quite significant.

* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.

* This is working well though with 1,000 observations. It seemed to be extremely ineffective with 100 observations however.

* What do we do if there are multiple endogenous binary variables?

* Let's generate similar data:

clear

set obs 1000

gen z1 = rnormal()

gen z2 = rnormal()

gen v1 = rnormal()

gen v2 = rnormal()

gen x1 = rnormal()

gen wp1 = normal(-.5 + z1*.5 - z2*.2 + .5^.5*v1)

gen w1 = rbinomial(1, wp1)

label var w1 "Endogenous variable 1"

gen wp2 = normal(.75 - z1*.5 + z2*.2 + .5^.5*v2)

gen w2 = rbinomial(1, wp2)

label var w2 "Endogenous variable 2"

gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 - .5^.5*v1 + .5^.5*v2)

gen y = rbinomial(1, yp)

* Once again we must adjust our expectation of the coefficients

local var_unob = 1+.5^.5^2+.5^.5^2

di "Variance of unobservables =" `var_unob'

local est_b0 = 1/(`var_unob')^.5

di "Constant coefficient =" `est_b0'

local est_w1 = 1/(`var_unob')^.5

di "w1 coefficient coefficient =" `est_w1'

local est_w2 = .7/(`var_unob')^.5

di "w2 coefficient coefficient =" `est_w2'

local est_x1 = .5/(`var_unob')^.5

di "x1 coefficient coefficient =" `est_x1'

probit y w1 w2 x1

test [y]w1=`est_w1'

test [y]w2=`est_w2'

test [y]_cons=`est_b0'

test [y]x1=`est_x1'

* The majority of estimates are not working well.

* Let's try doing the joint MLE

* Previously the biprobit was sufficient. However, biprobit only allow for a two-way probit.

* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.

* install mvprobit if not yet installed.

* ssc install mvprobit

* The syntax of mvprobit is very similar to that of biprobit

mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)

test [y]w1=`est_w1'

test [y]w2=`est_w2'

test [y]_cons=`est_b0'

test [y]x1=`est_x1'

* Unfortunately the estimates are still too far away from the true.

* However, they are closer.

* Let's see how the fitted probability line compares with the true.

predict yp_hat

replace yp_hat = normal(yp_hat)

reg yp_hat yp, nocon

predict yp_hat1_hat

two (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat<1 p=""> yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)

* Overall, not a bad fit.

* This might not be sufficient for many applications.

* What happens when one of the endogenous variables is continuous?