Saturday, February 16, 2013

Regression with Endogenous Explanatory Variables


* Imagine you would like to estimate the agricultural production process.

* You have two explanatory variables.  Rain and use of Hybrid or traditional seeds.

* You are concerned that better off (in terms of SES) framers will be more likely to use Hybrid seeds.

* So should you include hybrid as an explanatory variable?

* It explains much of your variation quite well.

* The question boils down to a single question.

* Is the use of hybrid seeds correlated with rainfall?

* This only makes sense of course if people decide to use hybrid seeds if they have some knowledge of season rain in their area and they make the purchase of hybrid seeds based that or if they make the decision of if and when to purchase hybrid seeds based on what they have observed this season so far (and there is some time dependency of rainfall).


* Our model is y = rainfall*br + hybrid*bh + u0


* First let's image the case when hrybid seed choice and rainfall are correlated.

* In that case we can no longer really call even rainfall exogenous.

* That is, rainfall is only exogenous conditional upon controlling for seed choice.

* ie. cov(rainfall, u| hybrid) = 0 because we know cov(rainfall,hrybid) ~= 0

* Why, because excluding hybrid from our model gives us:

* y = rainfall*br + u1    ... where u1 = hyrbid*bh + u0

* So cov(rainfall, u1) = cov(rainfall, u| hybrid) + cov(rainfall, hybrid*bh| hybrid) ~= 0

* This implies that we should include hybrid.

* This is true whenever we have two correlated explanatory variables.  Leaving one out will cause the other explanatory variable to be attributed a portion of the effect of the alternative variable.

* The bias is probably equal to: (x1 and x2 are explanatory variables, both exogenous (when both included) with x2 being correlated with x1.

* The true model: y = x1B1 + x2B2 + u

* Call the linear projection of x1 into x2: L(x2|x1)=A*x1

* Now estimating just y = x1B1 gives us y = x1B1 + L(x2|x1)*B2 + uhat  ... substituting the linear project of x1 on x2 into x2.

* Which reduces to y = x1B1 + A*x1*B2 = (B1 + A*B2)*x1

* The bias is therefore A*B2

* Let's see this in action

*****************
* Two "exogenous" variables

clear
set obs 10000

gen x1 = rnormal()

gen x2 = rnormal() + .5*x1    // A = .5
  * Formulating x2 this way is not important for the above result.
  * Any random draw of x1 and x2 that has them covarying together will produce the same results.
  * Only this way it is easy to see that A=.5

gen u = rnormal()*5

gen y = x1*2 + x2*(-3) + u    // B1 = 2,  B2 = -3,  A = .5

* No problem here
reg y x1 x2

* But we will try to not include x2

reg y x1

* Expected coefficient estimate = B1 + A*B2 =  2 + .5*(-3) = .5

* So pretty close.  In this case the bias is equal to A*B2 = -1.5

**********
* One "exogenous" and one "endogenous" variable

* However, we are concerned that hyrbid (x2) is correlated with our error.

* Should we include it?

* The answer is a little subtle.

* If we include it then we will introduce error

clear
set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(rain + ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

* No problem here
reg y rain hybrid ses

* However, we do not have data on ses so:
reg y rain hybrid

* Suddenly big problems.

* But we can still get an unbiased estimate of the effect of rainfall on production in the following manner.
reg y rain

*****************
* Rainfall uncorrelated with hybrid and hybrid endogenous

* However, let's see what happens if hybrid choice is endogenous but uncorrelated with rainfall.

clear
set obs 10000

gen rain = rnormal()
  label var rain "Normalized rain data"

gen ses = rnormal()
  label var ses "Unobserved Socio Economic Status"

gen hybrid = rbinomial(1, normal(ses))
  label var hybrid "Wether a person uses hybrid seeds"

gen u = rnormal()*5

gen y = rain + 3*hybrid + ses + u

reg y rain hybrid
  * The estimate on rainfall is good

reg y rain
  * The estimate on rainfall is not as accurate.  Why?

  * Because less of the unobserved variation is explained.



*****************
* Therefore I suggest taking the following steps:

* 1. Think through if any of your variables are endogenous.
* 2. If they are endogenous, estimate their correlation with the explanatory explanatory variables.
* 3. If there is no significant correlation then include them in the regression.  This will give you more precise estimates of the coefficients on the exogenous variables.  Do not interpret the coefficients on the endogenous variables.  Also, do not interpret the F-test for rejection of the model since you know that your endogenous variables are picking up part of the unobserved error.  If you instead want to test model fit, test the joint significance of all of your exogenous variables together with a Wald test.
* 4. If there is a correlation, then it is not clear what to do.  If you exclude endogenous variables correlated with the exogenous variables then this will introduce endogeneity in the the "exogenous variables".
* 5. However, if you introduce the endogenous variable as a control in the equation then you will introduce a level of bias and inconsistency.  I don't think there are any clear rules asking what to do in these circumstances.

1 comment:

  1. Isn't the issue really the potential for hybrid seed use to be correlated with other SES-driven factors, such as access to better equipment, that also impact growth but currently reside in the residual?

    This violates the OLS residual orthogonality assumption, and will lead to biased estimates of the impact of rainfall (br) and hybrid seed use (bh), per the underlying covariance algebra.

    ReplyDelete