## Monday, May 28, 2012

### Asymmetric Error with Right and Left Sensoring

* Dependent variable is censored:

* In a previous post I looked at what happens when we use the tobit maximum likelihood estimator even when the error term is not normally distributed.

* In general we see that despite the failure of the normality assumption the tobit is shown to be a good estimator in a wide variety situations with different error structures.

* However, all of those distributions of errors were symmetric distributions.

* There is no reason to believe that in general the unobserved heterogeneity should be symmetric around the expected value.

* Let's see what happens as we relax this assumption

cap program drop tobit_monte_carlo
program define tobit_monte_carlo, rclass

* Let's first set up the simulation
clear

* Set the number of observations
set obs 3000

* Let's imagine that we are trying to infer the damages caused by various things to homes in coastal cities.

* Generate some explanatory variables
gen weather = rpoisson(1)
label var weather "The home was hit by extreme weather."

gen crime = rbeta(2,6)
label var crime "Property crime rate in home's area"

gen occupants = rpoisson(4)
label var occupants "The number of people occupying the home"

gen age = (runiform()*40)+18
label var age "The age of the owner"

gen age2=age^2

gen credit = (runiform()*600)+200
label var credit "The credit worthiness of the owner"

* Now lets imagine that there is a lot of low level unexplained damages

* This will loop from 1 to 4 to 7 to 10.
foreach i in 2 1 6 9 {
* This generates a error distribution
gen e`i' = rbeta(2,`i')
sum e`i'
replace e`i'=(e`i'-r(mean))/r(sd)
if `i'==10    replace e`i'=e`i'*(-1)

* The name option saves the graph to memory with the name e`i'
if "`0'"=="graph" qui hist e`i', title(e~rbeta(2,`i')) name(e`i', replace)
* This creates a local list of all of the graphs in memory by adding on to the list every time this loops.
local graphnames `graphnames' e`i'

}
* Graphs the combined 4 graphs
if "`0'"=="graph" graph combine `graphnames'

foreach i in 2 1 6 9 {
* First let's generate the true thing we would like to understand. True amount of home damage.
gen home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000
}
sum home_damage*
* We can reasonably think of repairers made to the home as a reasonable interpretation for negative values of home_damage.

* However, we only have information on insurance payments.  Meaning each home had a different deductable:
gen deducatable = 5000

* Each home also has a maximum that the insurance policy will cover:
gen maximum = `2'

* Let us first impose our maximums and minimums
foreach i in 2 1 6 9 {
gen insurance_claims`i' = min(home_damage`i', maximum)
* This puts a cap on payouts but it is a little trickier figuring out minimums

* We know that if the claim is less than the deductible then it is not recorded.
replace insurance_claims`i' = 0 if insurance_claims`i'
}
sum insurance_claims*
* We can see the different distributions of errors slightly affect payouts but not by much.

*****************************************************************
*** simulation end

* So we want to know, how much did the different factors affect home damages?
* We can observe the insurance claims, the deductibles, and the maximum payout but not any damages that are less or more than that.

* remember home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000

* create a return list for the simulation command
gl return_list

* Let's see how well the OLS estimator does at recovering the coefficients
foreach i in 2 1 6 9 {
reg home_damage`i' weather crime occupants age age2 credit
foreach v in weather crime occupants age age2 credit {
return scalar OLS_`v'`i' = _b[`v']
gl return_list \$return_list OLS_`v'`i'=r(OLS_`v'`i')

}
}

* Let's see how well the tobit estimator does at recovering the coefficients
foreach i in 2 1 6 9 {
tobit home_damage`i' weather crime occupants age age2 credit, ll(5000) ul(`2')
foreach v in weather crime occupants age age2 credit {
return scalar Tob_`v'`i' = _b[`v']
gl return_list \$return_list Tob_`v'`i'=r(Tob_`v'`i')
}
}
* End program
end

tobit_monte_carlo graph 100000

di "simulate \$return_list , reps(50): tobit_monte_carlo nograph 100000"

simulate \$return_list , reps(50): tobit_monte_carlo nograph 100000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator generally outperforms the tobit estimator.
* There is no reason that this should be the case except that the data suffers from both top and bottom coding.

* I suspect that if there was only bottom coding then the Tobit estimator would outperform the OLS estimator.

* In order to test this we can try:
simulate \$return_list , reps(50): tobit_monte_carlo nograph 10000000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator still generally outperforms the tobit estimator.
* Though it is hard to say.  Perhaps this is due to the small sample size.
* A larger sample size should help the QMLE be more consistent.