## Monday, May 21, 2012

### Dependent variable is bottom coded

* This simulation follows James' Tobits 1958 paper on:
* Estimation of Relationships for Limited Dependent Variables By JAMES TOBIN

* You want to estimate the Bs from E(Y)=XB but you can only observe W=W_min=Y if YW_man and W=Y else.

* This kind of data censorship is very common.  Many things are only observable in that they are never less than zero or greater than some value.

* For example, if you are trying to estimate the number of children a man is likely to have, there is no defined upper limit but there is clearly a well defined lower limit.
set seed 101

cap program drop gen_data
* Define a program to generate the data
program gen_data

* Let's first set up the simulation
clear

version 11

* Set the number of observations
set obs 10000

* Set the random seed

* Generate some explanatory variables
gen man_num_sibs = rpoisson(3)
label var man_num_sibs "The number of sibblings that the man has"

gen woman_num_sibs = rpoisson(3)
label var woman_num_sibs "The number of sibblings that the spouse has"

gen income = abs(rnormal())*2
label var income "Family income, 10k/year"

* Generate a random normal error
gen e = rnormal()*2

* Generate the number of children each man has
gen Y = .8*man_num_sibs + .6*woman_num_sibs - 2*income + e
label var Y "The true underlying amount of children some men would have"

* Retrict the number of children to the positive range.
gen Nchildren = max(Y,0)

* Execute whatever command is specified with this program
`0'
* End the data generation
end

* Run the program once to generate the data
gen_data

reg Nchildren man_num_sibs woman_num_sibs income
* We can see that all of the coefficients are biased towards zero.

* This makes sense in that it if you restrict the range of the ys.
* Then the magnitudes of the coefficients is also appropriately restricted.

* However, we want to know what the uncensored effect of income is on the number of children is.

* Given the structure of the data we might restrict our sample say only to couples who both come from a family of 3 or more children.

gen restrict = 1 if man_num_sibs>=3 & woman_num_sibs>=3

reg Nchildren man_num_sibs woman_num_sibs income if restrict==1

* We can see that the estimates are now less biased if more noisy.  Let's try a further restriction:
gen restrict2 = 1 if man_num_sibs>3 & woman_num_sibs>3

reg Nchildren man_num_sibs woman_num_sibs income if restrict2==1
* The estimate is still pretty bad looking.

* We can see that the estimates are now less biased if more noisy.  Let's try a further restriction:
gen restrict3 = 1 if man_num_sibs>4 & woman_num_sibs>4

reg Nchildren man_num_sibs woman_num_sibs income if restrict3==1
* The estimate is looking better.

reg Nchildren man_num_sibs woman_num_sibs income if man_num_sibs>5 & woman_num_sibs>5 & income<1
* The estimate actually looks pretty good.  This does not necccessarily mean that the estimator is unbiased.
* Just that this draw of the estimator.

simulate, rep(50): gen_data reg Nchildren man_num_sibs woman_num_sibs income if man_num_sibs>5 & woman_num_sibs>5 & income<1
sum

* We can see that the estimates look pretty close to unbiased.
* However, the noisiness of the estimates is so large due to the restricted sample size that it would be nearly impossible for any single point estimate draw to be large enough to reject the null.

* The more concerning thing with a restriction such as this is that it is not clear any more what is being measured.
* Ie. what is the effect of income on children in a family where both parents come from families of 4 sibblings or more and where the income is less than 10,000 a year?
* Are these samples really comparable?

* A final restriction that might seem worth investigating is:
gen_data reg Nchildren man_num_sibs woman_num_sibs income if Nchildren>0

* This is really creating a sort of selection bias.  It does not help out the estimates.

* In effect all of these estimates are: w=XB + e + u(sample sensorship error)

* Because we are simulating data we can actually observe u (because we know Y).
gen u = Y - Nchildren

reg u man_num_sibs woman_num_sibs income
* We can see that all of the explanatory variables are strong predictors of the censorship error.

* What is a little tricky is why:
reg u man_num_sibs woman_num_sibs income if Nchildren>0
* does not mean that the restricted regression (if Nchildren>0) is not unbiased.

* Under this estimate the bias is from a different source.  This we can observe as well.

* It is due to there now being a correlation between the original errors (e) and the explanatory variables:

* We know that initially corr(X,e)=0
reg e man_num_sibs woman_num_sibs income

* However, what happens when we restrict the sample?
reg e man_num_sibs woman_num_sibs income if Nchildren>0

* Strange? No, not really.

* Just as the data is censored as a result of the choices of the X variables resulting in a sensorship error, so too if the data is restricted as a function of the explanatory variables then the unobserved error will restricted in a systematic fashion that is a function of the explanatory variables.

label var e "e (No restrictions)"

gen e2=e if Nchildren>0
label var e2 "e (# of children>0)"

two (line e income, sort ) (line e2 income, sort col(yellow)),  plotregion(fcolor(gs10)) ///
title(Data restrictions can lead to unfortunate correlations)

* However, fortunately in this case James Tobit made some thoughtful adjustments to OLS:
* Assuming that the error is normally distributed:

* Remember E(Y) = XB

* He recognized that the probability of observing W=Nchildren=W_min is equal to P(Y+e* = P(e* => P(Y-W_min>e)=1-Unit_Normal_CFD{(Y-W_min)/sd(e)}

* In other words, the likelihood of observing an outcome equal to the
* minimum value (W_min) is equal to the probabily that the random
* error e will push the Y value below W_min.

* Likewise the probability of observing an outcome x above W_min
* but below the observed W is:

* = P(W>x>L|Y) = P(Y-e>x) = P(e
* Stata has a built in command called the tobit:
* You just need to set the lower limit:

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

simulate, reps(50) : gen_data tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)
sum
* Looks like some pretty unbiased estiamtes

gen_data

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

* Now what happens when we do not allow the dependent variable to be continuous
replace Nchildren=round(Nchildren)

tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)

simulate, reps(50) : gen_data tobit Nchildren man_num_sibs woman_num_sibs income, ll(0)
sum

* We can see that making the number of children whole numbers does not bias the results.
* However, it does cause the estimators to be less precise (larger standard deviations).

* We can explain this easily:
* Y=XB + u
* Y_round = XB + u + rounding_error

* Since corr(rounding_error,X)=0, the only price of rounding the dependent variable is a larger error resulting in less precise estimators.

* Note, a poisson regression would typically be thought of as an appropriate regression given the count nature of the rounded data.

* This however, would not be appropriate given the underyling data generating process.
poisson Nchildren man_num_sibs woman_num_sibs income