Thursday, August 16, 2012

Linear models with heteroskedastic errors


* Imagine we have a sample of hotels and prices of rooms in those hotels
clear
set obs 200
set seed 101
* This sets the random seed at 101.  Thus every time this simulation is run it will product identical results.

gen star = ceil(runiform()*7)/2+.5
  label var star "Number of stars of hotel"

gen id = _n

gen het = abs(rnormal())
  label var het "Unobserved heterogeneity."

* Generate a the number of reservations observed per hotel.
gen num_reservations = rpoisson(150)
expand num_reservations

gen seasonality = abs(rnormal())
  label var seasonality "Seasonal demand"

gen v = abs(rnormal())
  label var v "Unobserved variance in the variance term"

gen u = rnormal()*(5+het*15+7.5*star+22.5*seasonality+v*10)
  label var u "Error term"

gen p = 175 + 20*star + 15*seasonality + u + het

sum p
* There are some p values which are less than 0 but we can think of those as special deals, coupons, refunds, or other situations that might result in the effective price being less than 0 dollars.
* If we were to eliminate the less than 0 prices then we would in be enforcing left censoring which is a different problem.  See "tobit".  This blog has several posts touching on the use of the tobit.

* Estimate the price through direct OLS
reg p star seasonality

* Set the panel level observation
xtset id

* Though we have panel data we cannot effectively use fixed effect or random effects approaches to identify the price effect having one more star has on prices.
xtreg p star seasonality, fe
xtreg p star seasonality, re

* Now let's attempt a two step method to more efficient identify the error and reestimate the OLS.

reg p star seasonality
* The OLS regression looks pretty good.  Let's see if we can improve on it.  Note, the 95% confidence interval did not capture the true coefficient of 20 but that is not necessarily a problem.

reg p star seasonality, robust
reg p star seasonality, cluster(id)
* Using robust and cluster robust estimates of the standard error does not change the variance so much that the 95% confidence interval encloses the 20.

predict uhat, resid

gen uhat_abs = abs(uhat)
  label var uhat_abs "Abs of OLS residual"

two  (scatter uhat_abs seasonality) (scatter uhat_abs star), ///
              legend(label(1 "Seasonality") label(2 "Stars"))



* Since E(u)==0 var(u) is equal to u squared Var(u)=(u-E(u))^2
* Likewise, u^2 is approximated by u^2
* Similarly sd(u) should be approximated by (u^2)^.5=abs(u)
reg uhat_abs star seasonality

predict uhat_abs_hat, xb
  * I might be doing this wrong.  I am trying aweights which say something about weighting by the inverse of the variance of the observation.)

gen uhat2 = 1/uhat_abs_hat^2

reg p star seasonality
* Let's see how the unweighted estimate performs

reg p star seasonality [aweight = uhat2]
* Using variance weights does not seem to improve the estimate

* Alternatively we can use the MLE estimator allowing the conditional standard deviation of the error as well as the conditional mean to vary linearly.

cap program drop mle_ols
program mle_ols
  args log_like xb sigma
  qui replace `log_like' = ln(normalden($ML_y1-`xb',0,`sigma'))
end

ml model lf mle_ols (price: p = star seasonality) (sigma: star seasonality)
ml maximize

* Using the MLE estimator we seem to have gained precision in the 3rd decimal place of the coefficients.  The 95% CI still does not enclose the 20 but it is closer.  Still, this is not indicative of a problem.  Perhaps if we simulated this 1000 times and substantially more than 50 times the CI did not enclose the 20 then we might be worried.

No comments:

Post a Comment