* Imagine that you have some data set in which one or more of your explanatory variables
* is censored (that is x>alpha is reported as alpha). This type of censoring is typical
* of some surveys such as income surveys when it might be possible to identify a person
* because within a demographic unit there are too few people who make above a certain
* high income level.
* In this simulation I ask the question, is there a problem with using a Tobit to "correct"
* for the censoring and then using the generated regressor as a predictor. This simulation
* will not of course give a general answer to that question but it will at least give
* a suggestion as to if in generated data things become problematic.
* First let's imagine that we have a decent set of observable variables x1-x10 which are
* correlated.
clear
set obs 1000
* First let's generate a correlation matrix.
* There must be an easier way but I use two steps to first generate vector C.
* Then expand it into a matrix with all non-diagnol elements equal to .5
matrix C=1,1,1,1,1,1,1,1,1,1
matrix B=C'*C*.5 + I(10)*.5
matrix list B
* Draw the explanatory variables
drawnorm x1 x2 x3 x4 x5 x6 x7 x8 x9 x10, corr(B)
* Things seem to have gone well
corr x1-x10
* First let's imagine there is a dependent variable:
* y=b0 + x1*b1 + x2*b2 + x3*b3 + x4*b4 + x5*b5 + u
gen y = -5 + x1*1 + x2*(-2) + x3*(1) + x4*(3) + x5*(1) + rnormal()*2
* Note: I am intentionally making the noise unrealistically low
* so that it is easier to identify the bias in the model without
* running repeated simulations.
* Before we do anything let's verify that OLS seems to be working well.
reg y x1-x5
* In general our standard regression seems to be working as expected
* (that is very well).
* Now let's wonder what happens if our explanatory variables x1 and x2
* are censored.
gen x1c = x1
replace x1c = .25 if x1c > .25
hist x1c
gen x2c = x2
replace x2c = .25 if x2c > .25
reg y x1c x2c x3-x5
* We can see that both x1c and x2c are now biased. This is because
* the error is now positively correlated with x1c and negatively
* correlated with x2c.
* We can see this: E(y|x1c, x2c, x3-x5) = -5 + x1c*1 + x2c*(-2) + x3*(1) +
* ... x5*(1) + e
* e = (x1 - x1c)*1 + (x2 - x2c)*(-2) + u
* corr(x1 - x1c, x1c) > 0
gen x1diff = x1-x1c
corr x1diff x1c
* Likewise corr(x2 - x2c, x2c) > 0
gen x2diff = x2-x2c
corr x2diff x2c
* Causing negative bias in the estimator since the cofficient is negative.
* The correction I propose is to impute the topcoded values using the
* variables available.
tobit x1c x2c x3-x10, ul(.25)
predict x1hat
hist(x1hat)
* Let's check the correlation between x1 and x1hat
corr x1 x1hat
* Not as good as could be hoped.
gen x1imp = x1c
replace x1imp = x1hat if x1c == .25 & x1hat > .25
scatter x1imp x1c
* The same thing with x2c
tobit x2c x1imp x3-x10, ul(.25)
predict x2hat
gen x2imp = x2c
replace x1imp = x1hat if x1c == .25 & x1hat >= .25
* Since x1 is cencored at .25 we know that only x1hat > .25 can
* be a better match
* Now let's try to estimate our model given our new variables.
reg y x1imp x2imp x3-x5
* Even if this method seemed to work we would need to
* boostrap or otherwise adjust the standard errors to account
* for the fact that we are now using generated regressors
* x1imp and x2imp
* We do not have sufficient information at this point to see if
* our imputation method is improving our estimates at all.
* We need to repeate the whole simulation a number of times to hope
* to answer that question.
cap program drop tobit_impute
program define tobit_impute, rclass
clear
set obs 1000
drawnorm x1 x2 x3 x4 x5 x6 x7 x8 x9 x10, corr(B)
gen y = -5 + x1*1 + x2*(-2) + x3*(1) + x4*(3) + x5*(1) + rnormal()*2
reg y x1-x5
return scalar x1_ols = _b[x1]
return scalar x2_ols = _b[x2]
return scalar x3_ols = _b[x3]
return scalar x4_ols = _b[x4]
return scalar x5_ols = _b[x5]
gen x1c = x1
replace x1c = .25 if x1c > .25
gen x2c = x2
replace x2c = .25 if x2c > .25
reg y x1c x2c x3-x5
return scalar x1_ols2 = _b[x1c]
return scalar x2_ols2 = _b[x2c]
return scalar x3_ols2 = _b[x3]
return scalar x4_ols2 = _b[x4]
return scalar x5_ols2 = _b[x5]
tobit x1c x2c x3-x10, ul(.25)
predict x1hat
gen x1imp = x1c
replace x1imp = x1hat if x1c == .25
tobit x2c x1imp x3-x10, ul(.25)
predict x2hat
gen x2imp = x2c
replace x2imp = x2hat if x2c == .25
reg y x1imp x2imp x3-x5
return scalar x1_ols3 = _b[x1imp]
return scalar x2_ols3 = _b[x2imp]
return scalar x3_ols3 = _b[x3]
return scalar x4_ols3 = _b[x4]
return scalar x5_ols3 = _b[x5]
end
tobit_impute
simulate ///
x1_ols=r(x1_ols) x2_ols=r(x2_ols) x3_ols=r(x3_ols) ///
x4_ols=r(x4_ols) x5_ols=r(x5_ols) ///
x1_ols2=r(x1_ols2) x2_ols2=r(x2_ols2) x3_ols2=r(x3_ols2) ///
x4_ols2=r(x4_ols2) x5_ols2=r(x5_ols2) ///
x1_ols3=r(x1_ols3) x2_ols3=r(x2_ols3) x3_ols3=r(x3_ols3) ///
x4_ols3=r(x4_ols3) x5_ols3=r(x5_ols3) ///
, rep(300): tobit_impute
sum
/*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
x1_ols | 300 .9975067 .0809674 .7533615 1.266807
x2_ols | 300 -1.998138 .0794069 -2.231982 -1.778604
x3_ols | 300 .9959236 .0788032 .7804456 1.203429
x4_ols | 300 3.003273 .0836668 2.726592 3.262676
x5_ols | 300 .9992972 .0808832 .7657179 1.212335
-------------+--------------------------------------------------------
x1_ols2 | 300 1.223525 .1201054 .8844188 1.606213
x2_ols2 | 300 -2.417704 .1229109 -2.758024 -2.092011
x3_ols2 | 300 .9224241 .0872901 .7420031 1.168221
x4_ols2 | 300 2.930535 .0910089 2.642824 3.200319
x5_ols2 | 300 .9313811 .0863095 .667203 1.172618
-------------+--------------------------------------------------------
x1_ols3 | 300 1.264906 .1325387 .8362417 1.708511
x2_ols3 | 300 -2.358331 .1368772 -2.704609 -1.99809
x3_ols3 | 300 1.01038 .0971799 .7765424 1.26452
x4_ols3 | 300 3.016996 .0963383 2.688083 3.329581
x5_ols3 | 300 1.019813 .0943988 .7494624 1.267622
*/
* In summary it seems that our estimate on x2 is slightly improved
* while our estimate on x1 is made worse. Overall our estimates are
* not really any better suggesting that using the tobit imputation
* failed to correct in a substantive way our censorship problems.
Formatted By Econometrics by Simulation
No comments:
Post a Comment