* Often times we are interested in estimating the effect of a binary endogenous regressor on a binary outcome variable.
* It is not obvious how to simulate data that will fit the criteria specifications that we desire.
* First let's think about the standard IV setup.
* y = b0 + xb1 + wb2 + u
* w = g0 + zg1 + v
* With u and v distributed normally and independently of x, w, and z.
* We know this setup is generally not correct if either y is binary or w is binary .
* When y is binary we might choose to use a MLE probit estimator with the following assumptions:
* P(y=1) = normal(b0 + xb1 + wb2)
* P(w=1) = normal(g0 + zg1)
* However, it is not easy to generate data in this form.
* Instead we will generate data introducing endogeneity by use of an unobserved addative error.
clear
set obs 1000
gen z = rnormal()
label var z "Exogenous instrument"
gen v = rnormal()
label var v "Unobserved error"
gen wp = normal(-.5 + z*1.25 + 1*v)
label var wp "Endogenous variable"
gen w = rbinomial(1, wp)
* The above equation should not be expected to estimate a coefficient of 1.25 on the z variable.
* This is because of the addative error term v which contributes to the implicit error of the normal CDF to create an error with a total variance of 1 + 1 = 2
* Thus, when the probit estimator is run it automatically scales the equation to be unit.
* We can discover consistent estimator by rescaling the coefficient on z to the true standard deviation.
di -.5/(2^.5) " for the constant"
di 1.25/(2^.5) " for the coefficient on z"
gen x1 = rnormal()
label var x1 "Exogenous variable"
probit w z
* Pretty close estimates to what we expect.
gen yp = normal(1 + w*1.5 + x1 - (2^.5)*v)
* Now we are including the v term in the generation of y in order to introduce an endogenous correlation between w and y.
* We need to adjust our estimated coefficients.
di "Variance of the unobserved =" 1^2 + 2^.5^2
di "Standard deviation =" (1^2 + 2^.5^2)^.5
di "Constant coefficient=" 1/(1^2 + 2^.5^2)^.5
di "x coefficient=" 1/(1^2 + 2^.5^2)^.5
local w_est =1.5/(1^2 + 2^.5^2)^.5
di "w coefficient=" `w_est'
gen y = rbinomial(1, yp)
* First let's see what happens if we neglect to make any effort at controlling for the endogeneity.
probit y w x1
test w= `w_est'
* Our estimate of the coefficient on w is way too small
* In order to estimate our relationship in a consistent manner we will use the biprobit command.
* This command effectively estimates two separate probit regressions with the allowance that the unobserved outcomes be correlated with the parameter /athrho.
* In this case, the unobserved component from the w estimation equation is the endogenous component from the y estimation equation.
* Since, w only is entering y linearly it is sufficient that the unobserved portion of w correlated with y is in a sense controlled for through the joint probit regression.
biprobit (y = w x1) (w=z x1)
test w= `w_est'
* We can test the "endogeneity" of w by testing the significance of /athrho. Which appears in this case to be quite significant.
* Not bad, we fail to reject there being any difference between our estimate of the coefficient on w and the true.
* This is working well though with 1,000 observations. It seemed to be extremely ineffective with 100 observations however.
* What do we do if there are multiple endogenous binary variables?
* Let's generate similar data:
clear
set obs 1000
gen z1 = rnormal()
gen z2 = rnormal()
gen v1 = rnormal()
gen v2 = rnormal()
gen x1 = rnormal()
gen wp1 = normal(-.5 + z1*.5 - z2*.2 + .5^.5*v1)
gen w1 = rbinomial(1, wp1)
label var w1 "Endogenous variable 1"
gen wp2 = normal(.75 - z1*.5 + z2*.2 + .5^.5*v2)
gen w2 = rbinomial(1, wp2)
label var w2 "Endogenous variable 2"
gen yp = normal(.1 + w1*.7 + w2*1 + .5*x1 - .5^.5*v1 + .5^.5*v2)
gen y = rbinomial(1, yp)
* Once again we must adjust our expectation of the coefficients
local var_unob = 1+.5^.5^2+.5^.5^2
di "Variance of unobservables =" `var_unob'
local est_b0 = 1/(`var_unob')^.5
di "Constant coefficient =" `est_b0'
local est_w1 = 1/(`var_unob')^.5
di "w1 coefficient coefficient =" `est_w1'
local est_w2 = .7/(`var_unob')^.5
di "w2 coefficient coefficient =" `est_w2'
local est_x1 = .5/(`var_unob')^.5
di "x1 coefficient coefficient =" `est_x1'
probit y w1 w2 x1
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'
* The majority of estimates are not working well.
* Let's try doing the joint MLE
* Previously the biprobit was sufficient. However, biprobit only allow for a two-way probit.
* Fortunately, there is a user written command that uses simulation to approximate a multivariate probit.
* install mvprobit if not yet installed.
* ssc install mvprobit
* The syntax of mvprobit is very similar to that of biprobit
mvprobit (y = w1 w2 x1) (w1=z1 z2 x1) (w2=z1 z2 x1)
test [y]w1=`est_w1'
test [y]w2=`est_w2'
test [y]_cons=`est_b0'
test [y]x1=`est_x1'
* Unfortunately the estimates are still too far away from the true.
* However, they are closer.
* Let's see how the fitted probability line compares with the true.
predict yp_hat
replace yp_hat = normal(yp_hat)
reg yp_hat yp, nocon
predict yp_hat1_hat
two (scatter yp yp_hat) (line yp yp_hat1_hat if yp_hat1_hat<1 p=""> yscale( range(0 1 ) ) xlabel(0 1) legend(off) title(Predicted probability against true)
1>
* Overall, not a bad fit.
* This might not be sufficient for many applications.
* What happens when one of the endogenous variables is continuous?