## Monday, September 3, 2012

### Robust Hausman Test

* The Huasman test is a commonly used to indicate an ideal choice between fixed effect and random effect estiamtors (in a panel data context).  This robust estimator was first proposed by Arellano (1993) {http://ideas.repec.org/a/eee/econom/v59y1993i1-2p87-97.html}.

* If I understand this properly, the RE estimator is a GLS estimator that should only be used when the individualized effect of each person (referred to as their fixed effect) is uncorrelated with the explanatory variables and uncorrelated with the outcome variables.

* This exogeneity of individual heterogeneity is often better understood in the situations when it fails rather than when the assumption is upheld.

* Imagine that motivation is relatively constant for individuals.

* If we have multiple years of GPA, which we are trying to predict and number of hours spent studying, then accross individuals it might be difficult to estimate GPA as a function of hours worked if we ignore the unobserved factor motivation because motivation may cause individuals to both study more hours and do better in general regardless of hours spent studying.

* Let's see a simple simulation of this:

clear
set obs 10000

gen id=_n

gen motivation = runiform()
label var motivation "Unobserved student motivation"

expand 3
* We have three years of data per student

* The more motivated students are the more they study
gen hours_study = runiform()*2+motivation

gen attendance = runiform()

gen u = rnormal()*5*hours_study
* This is creating some heterogeneity in the error proportional to hours of study.

gen GPA = motivation + hours_study + attendance/2 + u

* Now we have this data that we are concerned might not be suitable for RE but we would like to if we could since RE is more efficient that FE when the assumptions are met.

xtset id

* Stata has a built in command to do the traditional Hausman test:
xtreg GPA hours_study attendance, fe
est store fe
xtreg GPA hours_study attendance, re
est store re
hausman fe re

* Alternatively using the Chamberlain-Munlack Device, we can do a similar estimation:
foreach v in hours_study attendance {
bysort id: egen mean_`v' = mean(`v')
}

xtreg GPA hours_study mean_hours_study attendance mean_attendance, re
test mean_hours_study mean_attendance
* This test result is not exactly the same. I think it is due to the tests being asympotically equivalent while in finite samples, not equivalent.

* I think this second form of the test is more informative.  We are adding the mean values of each of our explanatory variables (by individual) and seeing if those mean values have additional explanatory power outside of that of their levels.

* This was somewhat disarming for me.  I thought, well what about the unexplained variation uncorrelated with the mean explanatory variables?

* Well, since a FE model can only control for fixed unexplained variation then controlling for that unexplained variation through use of means is surprisingly comprehensive.

* If the means of explanatory variables by individuals is uncorrelated with the error then using a fixed effect approach is not going to improve the estimation outcomes.

* The additional benefit of this form of the Hausman test is that it is extremely easy to make this estimator robust.

xtreg GPA hours_study mean_hours_study attendance mean_attendance, re vce(cluster id)
test mean_hours_study mean_attendance

* Since the mean variables are jointly significant, this suggests to us that we must assume there is unobserved heterogeneity that is correlated with the explanatory variable and the outcome variable and is therefore problematic to effective RE estimation, therefore FE is preferred.

* Note also, this same kind of logic can be applied to a decisions between FE and pooled OLS since it can be shown that RE is a weighted estimator between FE and Pooled OLS.