* I often worry about linearity assumptions and their
* implications. However, there are some serious strengths
* to the assumptions of linearity.
* For instance, what if the relationship between unobservable
* variable z and the observable variable education is non-linear
* and the relationship between z and earnings is either linear or
* non-linear, does including a fixed effect in the model
* still control for this potential source of bias?
* Stata code
clear
* Imagine we have 10,000 individuals that we track
set obs 10000
set seed 1021
gen intelligence = abs(rnormal())
label var intelligence "Intelligence"
gen int_test = intelligence+rnormal()/3
* Imagine we have data from some intelligence test available.
* It is a proxy for intigence but does not truly measure
* underlying intelligence.
sort int_test
gen int_tile = _n
* Ranks the intelligence of subjects from
* 1 to _N
gen int_per = int_tile/(_N+1)
* changes the scale of intelligence from 1 to _N to
* 0 to 1
* Now lets assume you fit a normal distribution
gen IQ_test = invnormal(int_per)
label var IQ_test "Normal fitted values for intelligence approximation"
hist IQ_test, bin(100) kden
gen GPA = round(4*(1-invibeta(2,5, 1-int_per)),.1)
hist GPA, bin(20) kden
* Imagine that there has been grade inflation
gen id = _n
label var id "Individual specific ID"
* create 5 observations for each initial observation
expand 5
bysort id: gen year=_n*5
label var year "Year of observation"
tab year
gen education=0
bysort id: replace education = round(rbinomial(1,.3)+intelligence^.7+education[_n-1],.25) if _n>1
label var education "Explanatory variable education (with time constant and time varying components)"
* Now education is non-linearly a function of intelligence
corr education intelligence
* So there is clearly a strong correlation between intelligence
* and years of education in the data.
gen intel_7= intelligence^.7
corr education intel_7
* has an even better fit.
gen u = 10*rnormal()+2*intelligence^2
label var u "Error term (correlated with unobservables intelligence)"
* Likewise the erorr is a different non-linear function of intelligence
corr u edu
* Thus the error term and eduction are correlated.
gen earnings = education + u
label var earnings "Earnings"
reg earnings education
* We can see that OLS is biased because of the correlation
* between the error term and the explanatory variable.
* Note: that though the earnings expression above did not specify
* an intercept, there exists one as a result of the error term
* having a non-zero expected value because intelligence has a
* non-zero expected value.
* Let us trying including intelligence
reg earnings education intelligence
* Let us trying including our intelligence test as a proxy for
* intelligence.
reg earnings education int_test
* We can see that the estimate is still biased if not as badly as
* that of not including any proxy.
* Imagine instead of having a corretly scaled but noisy
* intelligence test we have an IQ test that imposes its
* own scale on intelligence from a noisy measure.
reg earnings education IQ_test
* This creates a more biased measure.
* Now imagine you do not even have an intelligence test but
* instead GPA which imposes its own inflated scale as a measure
* of intelligence.
reg earnings education GPA
* This estimate is more biased. Ultimately, this is because
* intelligence is more correlated with IQ than GPA.
corr intelligence IQ_test GPA
* Note, that even though GPA is a bad measure it is still a better
* proxy than intelligence.
* Now, the thing about this data is that you need not worry too
* much about the proxy's for intelligence or even intelligence
* for intelligence is assumed to be a fixed factor in your data
* and the effect of intelligence can therefore be controlled
* by a simple fixed effect model.
areg earnings education, absorb(id)
* This shows the immense power of having panel data. No matter,
* the relationship between fixed unobserved variables, and the
* explanatory variables, and the error. So long as the effect
* of intelligence on the returns to education is linear. A simple
* fixed effects model is sufficient to control for this bias.
* However, if the returns to education (the coefficient) is also
* correlated with intelligence (reasonable) then we are in a different
* much more complicated scenario and we had best start thinking
* about the world of Correlated Random Coefficients.
No comments:
Post a Comment