Friday, May 18, 2012

Unobserved effects model - extensions

* I often worry about linearity assumptions and their
* implications.  However, there are some serious strengths
* to the assumptions of linearity.

* For instance, what if the relationship between unobservable
* variable z and the observable variable education is non-linear
* and the relationship between z and earnings is either linear or
* non-linear, does including a fixed effect in the model
* still control for this potential source of bias?

* Stata code

* Imagine we have 10,000 individuals that we track
set obs 10000
set seed 1021

gen intelligence = abs(rnormal())
  label var intelligence "Intelligence"

gen int_test = intelligence+rnormal()/3
  * Imagine we have data from some intelligence test available.
  * It is a proxy for intigence but does not truly measure
  * underlying intelligence.

sort int_test
gen int_tile = _n
  * Ranks the intelligence of subjects from
  * 1 to _N

gen int_per = int_tile/(_N+1)
  * changes the scale of intelligence from 1 to _N to
  * 0 to 1

* Now lets assume you fit a normal distribution
gen IQ_test = invnormal(int_per)
  label var IQ_test "Normal fitted values for intelligence approximation"
  hist IQ_test, bin(100) kden

gen GPA = round(4*(1-invibeta(2,5, 1-int_per)),.1)
  hist GPA, bin(20) kden
* Imagine that there has been grade inflation

gen id = _n
  label var id "Individual specific ID"

* create 5 observations for each initial observation
expand 5

bysort id: gen year=_n*5
  label var year "Year of observation"

tab year

gen education=0
bysort id: replace education = round(rbinomial(1,.3)+intelligence^.7+education[_n-1],.25) if _n>1
  label var education "Explanatory variable education (with time constant and time varying components)"
* Now education is non-linearly a function of intelligence

corr education intelligence
* So there is clearly a strong correlation between intelligence
* and years of education in the data.

gen intel_7= intelligence^.7

corr education intel_7
* has an even better fit.

gen u = 10*rnormal()+2*intelligence^2
  label var u "Error term (correlated with unobservables intelligence)"
  * Likewise the erorr is a different non-linear function of intelligence

corr u edu
  * Thus the error term and eduction are correlated.

gen earnings = education + u
  label var earnings "Earnings"

reg earnings education
* We can see that OLS is biased because of the correlation
* between the error term and the explanatory variable.

* Note: that though the earnings expression above did not specify
* an intercept, there exists one as a result of the error term
* having a non-zero expected value because intelligence has a
* non-zero expected value.

* Let us trying including intelligence
reg earnings education intelligence

* Let us trying including our intelligence test as a proxy for
* intelligence.
reg earnings education int_test

* We can see that the estimate is still biased if not as badly as
* that of not including any proxy.

* Imagine instead of having a corretly scaled but noisy
* intelligence test we have an IQ test that imposes its
* own scale on intelligence from a noisy measure.
reg earnings education IQ_test
* This creates a more biased measure.

* Now imagine you do not even have an intelligence test but
* instead GPA which imposes its own inflated scale as a measure
* of intelligence.
reg earnings education GPA
* This estimate is more biased.  Ultimately, this is because
* intelligence is more correlated with IQ than GPA.

corr intelligence IQ_test GPA

* Note, that even though GPA is a bad measure it is still a better
* proxy than intelligence.

* Now, the thing about this data is that you need not worry too
* much about the proxy's for intelligence or even intelligence
* for intelligence is assumed to be a fixed factor in your data
* and the effect of intelligence can therefore be controlled
* by a simple fixed effect model.

areg earnings education, absorb(id)
* This shows the immense power of having panel data.  No matter,
* the relationship between fixed unobserved variables, and the
* explanatory variables, and the error.  So long as the effect
* of intelligence on the returns to education is linear. A simple
* fixed effects model is sufficient to control for this bias.

* However, if the returns to education (the coefficient) is also
* correlated with intelligence (reasonable) then we are in a different
* much more complicated scenario and we had best start thinking
* about the world of Correlated Random Coefficients.

No comments:

Post a Comment