## Wednesday, October 30, 2013

### Using Tobit to Impute Censored Regressors

* Imagine that you have some data set in which one or more of your explanatory variables
* is censored (that is x>alpha is reported as alpha).  This type of censoring is typical
* of some surveys such as income surveys when it might be possible to identify a person
* because within a demographic unit there are too few people who make above a certain
* high income level.

* In this simulation I ask the question, is there a problem with using a Tobit to "correct"
* for the censoring and then using the generated regressor as a predictor.  This simulation
* will not of course give a general answer to that question but it will at least give
* a suggestion as to if in generated data things become problematic.

* First let's imagine that we have a decent set of observable variables x1-x10 which are
* correlated.

clear

set obs 1000

* First let's generate a correlation matrix.
* There must be an easier way but I use two steps to first generate vector C.
* Then expand it into a matrix with all non-diagnol elements equal to .5
matrix C=1,1,1,1,1,1,1,1,1,1
matrix B=C'*C*.5 + I(10)*.5
matrix list B

* Draw the explanatory variables
drawnorm x1 x2 x3 x4 x5 x6 x7 x8 x9 x10, corr(B)

* Things seem to have gone well
corr x1-x10

* First let's imagine there is a dependent variable:
* y=b0 + x1*b1 + x2*b2 + x3*b3 + x4*b4 + x5*b5 + u

gen y = -5 + x1*1 + x2*(-2) + x3*(1) + x4*(3) + x5*(1) + rnormal()*2

* Note: I am intentionally making the noise unrealistically low
* so that it is easier to identify the bias in the model without
* running repeated simulations.

* Before we do anything let's verify that OLS seems to be working well.
reg y x1-x5

* In general our standard regression seems to be working as expected
* (that is very well).

* Now let's wonder what happens if our explanatory variables x1 and x2
* are censored.

gen x1c = x1
replace x1c = .25 if x1c > .25

hist x1c




gen x2c = x2
replace x2c = .25 if x2c > .25

reg y x1c x2c x3-x5
* We can see that both x1c and x2c are now biased.  This is because
* the error is now positively correlated with x1c and negatively
* correlated with x2c.

* We can see this: E(y|x1c, x2c, x3-x5) = -5 + x1c*1 + x2c*(-2) + x3*(1) +
* ... x5*(1) + e

* e = (x1 - x1c)*1 + (x2 - x2c)*(-2) + u

* corr(x1 - x1c, x1c) > 0

gen x1diff = x1-x1c
corr x1diff x1c

* Likewise corr(x2 - x2c, x2c) > 0

gen x2diff = x2-x2c
corr x2diff x2c

* Causing negative bias in the estimator since the cofficient is negative.

* The correction I propose is to impute the topcoded values using the
* variables available.

tobit x1c x2c x3-x10, ul(.25)
predict x1hat

hist(x1hat)




* Let's check the correlation between x1 and x1hat
corr x1 x1hat

* Not as good as could be hoped.
gen x1imp = x1c
replace x1imp = x1hat if x1c == .25 & x1hat > .25

scatter x1imp x1c




* The same thing with x2c
tobit x2c x1imp x3-x10, ul(.25)
predict x2hat
gen x2imp = x2c
replace x1imp = x1hat if x1c == .25 & x1hat >= .25
* Since x1 is cencored at .25 we know that only x1hat > .25 can
* be a better match

* Now let's try to estimate our model given our new variables.
reg y x1imp x2imp x3-x5

* Even if this method seemed to work we would need to
* boostrap or otherwise adjust the standard errors to account
* for the fact that we are now using generated regressors
* x1imp and x2imp

* We do not have sufficient information at this point to see if
* our imputation method is improving our estimates at all.

* We need to repeate the whole simulation a number of times to hope

cap program drop tobit_impute
program define tobit_impute, rclass

clear
set obs 1000
drawnorm x1 x2 x3 x4 x5 x6 x7 x8 x9 x10, corr(B)
gen y = -5 + x1*1 + x2*(-2) + x3*(1) + x4*(3) + x5*(1) + rnormal()*2

reg y x1-x5
return scalar x1_ols = _b[x1]
return scalar x2_ols = _b[x2]
return scalar x3_ols = _b[x3]
return scalar x4_ols = _b[x4]
return scalar x5_ols = _b[x5]

gen x1c = x1
replace x1c = .25 if x1c > .25

gen x2c = x2
replace x2c = .25 if x2c > .25

reg y x1c x2c x3-x5
return scalar x1_ols2 = _b[x1c]
return scalar x2_ols2 = _b[x2c]
return scalar x3_ols2 = _b[x3]
return scalar x4_ols2 = _b[x4]
return scalar x5_ols2 = _b[x5]

tobit x1c x2c x3-x10, ul(.25)
predict x1hat

gen x1imp = x1c
replace x1imp = x1hat if x1c == .25

tobit x2c x1imp x3-x10, ul(.25)
predict x2hat
gen x2imp = x2c
replace x2imp = x2hat if x2c == .25

reg y x1imp x2imp x3-x5
return scalar x1_ols3 = _b[x1imp]
return scalar x2_ols3 = _b[x2imp]
return scalar x3_ols3 = _b[x3]
return scalar x4_ols3 = _b[x4]
return scalar x5_ols3 = _b[x5]

end

tobit_impute

simulate  ///
x1_ols=r(x1_ols) x2_ols=r(x2_ols) x3_ols=r(x3_ols)  ///
x4_ols=r(x4_ols) x5_ols=r(x5_ols)  ///
x1_ols2=r(x1_ols2) x2_ols2=r(x2_ols2) x3_ols2=r(x3_ols2)  ///
x4_ols2=r(x4_ols2) x5_ols2=r(x5_ols2)  ///
x1_ols3=r(x1_ols3) x2_ols3=r(x2_ols3) x3_ols3=r(x3_ols3)  ///
x4_ols3=r(x4_ols3) x5_ols3=r(x5_ols3)  ///
, rep(300): tobit_impute

sum

/*

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
x1_ols |       300    .9975067    .0809674   .7533615   1.266807
x2_ols |       300   -1.998138    .0794069  -2.231982  -1.778604
x3_ols |       300    .9959236    .0788032   .7804456   1.203429
x4_ols |       300    3.003273    .0836668   2.726592   3.262676
x5_ols |       300    .9992972    .0808832   .7657179   1.212335
-------------+--------------------------------------------------------
x1_ols2 |       300    1.223525    .1201054   .8844188   1.606213
x2_ols2 |       300   -2.417704    .1229109  -2.758024  -2.092011
x3_ols2 |       300    .9224241    .0872901   .7420031   1.168221
x4_ols2 |       300    2.930535    .0910089   2.642824   3.200319
x5_ols2 |       300    .9313811    .0863095    .667203   1.172618
-------------+--------------------------------------------------------
x1_ols3 |       300    1.264906    .1325387   .8362417   1.708511
x2_ols3 |       300   -2.358331    .1368772  -2.704609   -1.99809
x3_ols3 |       300     1.01038    .0971799   .7765424    1.26452
x4_ols3 |       300    3.016996    .0963383   2.688083   3.329581
x5_ols3 |       300    1.019813    .0943988   .7494624   1.267622
*/

* In summary it seems that our estimate on x2 is slightly improved
* while our estimate on x1 is made worse.  Overall our estimates are
* not really any better suggesting that using the tobit imputation
* failed to correct in a substantive way our censorship problems.

Formatted By Econometrics by Simulation