Monday, May 28, 2012

Asymmetric Error with Right and Left Sensoring


* Dependent variable is censored:

* In a previous post I looked at what happens when we use the tobit maximum likelihood estimator even when the error term is not normally distributed.

* In general we see that despite the failure of the normality assumption the tobit is shown to be a good estimator in a wide variety situations with different error structures.

* However, all of those distributions of errors were symmetric distributions.

* There is no reason to believe that in general the unobserved heterogeneity should be symmetric around the expected value.

* Let's see what happens as we relax this assumption

cap program drop tobit_monte_carlo
program define tobit_monte_carlo, rclass

  * Let's first set up the simulation
  clear

  * Set the number of observations
  set obs 3000

  * Let's imagine that we are trying to infer the damages caused by various things to homes in coastal cities.

  * Generate some explanatory variables
  gen weather = rpoisson(1)
    label var weather "The home was hit by extreme weather."

  gen crime = rbeta(2,6)
    label var crime "Property crime rate in home's area"

  gen occupants = rpoisson(4)
    label var occupants "The number of people occupying the home"

  gen age = (runiform()*40)+18
    label var age "The age of the owner"

  gen age2=age^2

  gen credit = (runiform()*600)+200
    label var credit "The credit worthiness of the owner"

  * Now lets imagine that there is a lot of low level unexplained damages

  * This will loop from 1 to 4 to 7 to 10.
  foreach i in 2 1 6 9 {
   * This generates a error distribution
   gen e`i' = rbeta(2,`i')
   sum e`i'
   replace e`i'=(e`i'-r(mean))/r(sd)
   if `i'==10    replace e`i'=e`i'*(-1)
 
   * The name option saves the graph to memory with the name e`i'
  if "`0'"=="graph" qui hist e`i', title(e~rbeta(2,`i')) name(e`i', replace)
   * This creates a local list of all of the graphs in memory by adding on to the list every time this loops.
local graphnames `graphnames' e`i'

  }
  * Graphs the combined 4 graphs
  if "`0'"=="graph" graph combine `graphnames'

  foreach i in 2 1 6 9 {
  * First let's generate the true thing we would like to understand. True amount of home damage.
    gen home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000
  }
  sum home_damage*
  * We can reasonably think of repairers made to the home as a reasonable interpretation for negative values of home_damage.

  * However, we only have information on insurance payments.  Meaning each home had a different deductable:
  gen deducatable = 5000

  * Each home also has a maximum that the insurance policy will cover:
  gen maximum = `2'

  * Let us first impose our maximums and minimums
  foreach i in 2 1 6 9 {
    gen insurance_claims`i' = min(home_damage`i', maximum)
* This puts a cap on payouts but it is a little trickier figuring out minimums

* We know that if the claim is less than the deductible then it is not recorded.
replace insurance_claims`i' = 0 if insurance_claims`i'
  }
  sum insurance_claims*
  * We can see the different distributions of errors slightly affect payouts but not by much.

*****************************************************************
*** simulation end

* So we want to know, how much did the different factors affect home damages?
* We can observe the insurance claims, the deductibles, and the maximum payout but not any damages that are less or more than that.

* remember home_damage`i' = -10000 + 100000*weather + 10000*crime + 5000*occupants - 500*age + 20*age2 + 100*credit + e`i'*100000

* create a return list for the simulation command
gl return_list

* Let's see how well the OLS estimator does at recovering the coefficients
  foreach i in 2 1 6 9 {
    reg home_damage`i' weather crime occupants age age2 credit
    foreach v in weather crime occupants age age2 credit {
 return scalar OLS_`v'`i' = _b[`v']
 gl return_list $return_list OLS_`v'`i'=r(OLS_`v'`i')

}
  }

  * Let's see how well the tobit estimator does at recovering the coefficients
  foreach i in 2 1 6 9 {
    tobit home_damage`i' weather crime occupants age age2 credit, ll(5000) ul(`2')
     foreach v in weather crime occupants age age2 credit {
 return scalar Tob_`v'`i' = _b[`v']
 gl return_list $return_list Tob_`v'`i'=r(Tob_`v'`i')
}
 }
* End program
end

tobit_monte_carlo graph 100000

di "simulate $return_list , reps(50): tobit_monte_carlo nograph 100000"

simulate $return_list , reps(50): tobit_monte_carlo nograph 100000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator generally outperforms the tobit estimator.
* There is no reason that this should be the case except that the data suffers from both top and bottom coding.

* I suspect that if there was only bottom coding then the Tobit estimator would outperform the OLS estimator.

* In order to test this we can try:
simulate $return_list , reps(50): tobit_monte_carlo nograph 10000000

order *weather* *crime* *occup* *age? *age2? *credit*

sum
* It seems that the OLS estimator still generally outperforms the tobit estimator.
* Though it is hard to say.  Perhaps this is due to the small sample size.
* A larger sample size should help the QMLE be more consistent.

No comments:

Post a Comment