Tuesday, October 8, 2013

Finite Sample Properties of IV - Weak Instrument Bias

* There is no proof that an instrumental variables (IV) estimator is unbiased.

* In fact we know that in small enough samples the bias can be large.

* Let's see a simple setup with the endogeneity a result of omitted variable bias.

* Our instrument is valid, though biased because we are using a "small" sample and the instrument is weak.

clear
set obs 1000

gen z = rnormal()

gen w = rnormal()

gen x = z*.3 + rnormal() + w

gen u = rnormal()

gen y = x + w + u*5

reg y x

ivreg y (x=z)
* IVreg includes the true estimate in the confidence interval though the interval is quite wide.

* This is largely the result of z being a weak instrument for x
reg x z

* There is a conjecture that the IV estimator is biased in finite samples.

* In order to examine this bias we will run a monte carlo 
*  simulation to see how biased our estimates are at each level.

cap program drop weakreg
program weakreg, rclass
  clear
  set obs `1' 
  * The first argument of the weakreg command is the number of 
  *  observations to draw.
  gen z = rnormal()
  gen w = rnormal()
  gen x = z*.2 + rnormal() + w
  gen u = rnormal()
  gen y = x + w + u*5
  reg y x
    return scalar reg_x = _b[x]
 return scalar reg_se_x = _se[x]
  ivreg y (x=z)
    return scalar ivreg_x = _b[x]
 return scalar iv_se_x = _se[x]
end 

* With only 100 observations
simulate reg_x=r(reg_x)     reg_se_x=r(reg_se_x)  /// 
         ivreg_x=r(ivreg_x) iv_se_x=r(iv_se_x)    /// 
   , rep(10000): weakreg 100
sum
/*
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       reg_x |     10000    1.484123    .3619725   .1481085   2.739473
    reg_se_x |     10000     .357267    .0359624   .2261475   .5157685
     ivreg_x |     10000    14.45544    846.8431  -1318.675   78604.67
     iv_se_x |     10000    208878.2    1.93e+07   .6884335   1.92e+09
*/

* We can see the mean standard error estimate is much
* larger than the standard deviation of the estimates.

* In addition, the apparent bias of the IV is huge!
* Thus OLS is the better estimator in this case.

simulate reg_x=r(reg_x)     reg_se_x=r(reg_se_x)  /// 
         ivreg_x=r(ivreg_x) iv_se_x=r(iv_se_x)    /// 
   , rep(10000): weakreg 300
sum

/*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       reg_x |     10000     1.48824    .2065498   .6367978   2.236391
    reg_se_x |     10000     .204807    .0117558   .1625977   .2537656
     ivreg_x |     10000     .883504    16.43395  -489.2355   1346.839
     iv_se_x |     10000     103.928    7065.172   .5418385   696229.8

*/
* Increasing the sample size to 300 vastly improves the IV estimator.
* Though it is now downward biased.

simulate reg_x=r(reg_x)     reg_se_x=r(reg_se_x)  /// 
         ivreg_x=r(ivreg_x) iv_se_x=r(iv_se_x)    /// 
   , rep(10000): weakreg 500
sum

/*
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       reg_x |     10000    1.490552     .159073   .9184232   2.088732
    reg_se_x |     10000    .1584161    .0071414   .1326629   .1896385
     ivreg_x |     10000     .841985    4.533252  -337.8417   41.04016
     iv_se_x |     10000    10.02729    672.1545   .5350561   66082.69
*/
* Increasing the sample size to 500 does not seem to improve the bias 
* of the IV estimator. Though the standard errors on average seem to be
* getting closer to the standard deviations of the estimators.

simulate reg_x=r(reg_x)     reg_se_x=r(reg_se_x)  /// 
         ivreg_x=r(ivreg_x) iv_se_x=r(iv_se_x)    /// 
   , rep(10000): weakreg 750
sum

/*
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       reg_x |     10000    1.491437    .1291091   1.012212   1.998231
    reg_se_x |     10000    .1292925    .0047437   .1129745   .1498492
     ivreg_x |     10000    .9714284    1.087322  -12.33198   7.718197
     iv_se_x |     10000    1.080065    .8625042   .4623444   39.95131
*/

* Increasing the sample size to 750 dramatically improves the IV estimator.
* It is still slightly biased but that is not a huge problem.
* Now the standard errors are working very well as well.
* The only problem would be the IV estimator still has such large variation
* that both the OLS estimator and the 0 coefficient would be included in
* most confidence intervals.

simulate reg_x=r(reg_x)     reg_se_x=r(reg_se_x)  /// 
         ivreg_x=r(ivreg_x) iv_se_x=r(iv_se_x)    /// 
   , rep(10000): weakreg 1000
sum

/*
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       reg_x |     10000    1.488341    .1107187   1.064637   1.938462
    reg_se_x |     10000     .111871    .0035312   .1001198   .1255203
     ivreg_x |     10000    .9691499    .8924782  -5.659174    5.94314
     iv_se_x |     10000    .8812981    .2897863     .44236   6.674775
*/

* We can see that our primary gains from more observations is a smaller
* standard error.

Formatted By Econometrics by Simulation

2 comments:

  1. Hi

    as somebody who regularly consumes cross-country empirical research based on IV regressions with samples of 50-100, I found this quite alarming. But then most of the papers I read will be panel, with T of let's say 50.

    this question may reveal shocking ignorance, but if the number of observations in a panel (N*T) is say 100 * 50, does that translate into a (very) safe sample size?

    ReplyDelete
  2. Why do you use -ivreg- instead of -ivregress-?

    ReplyDelete