## Sunday, November 4, 2012

### Modeling Heteroskedasticity

Original Code

* Heteroskedasticity is a term most frequently used in economics to refer to errors which have variances that vary in some functional way with observable characteristics.  Heteroskedasticity does not cause bias but it does cause inefficiency (relative to an estimator that controls for heteroskedasticity - if one exists).  What does this mean?  Imagine that you are attempting to model the price of home sales as a function of observable data such as # of rooms, # of bathrooms, yard size, etc.  But there is still a great deal of variation in your outcomes.  You also know a little about the purchasers.  Such as annual income and total assets.  So you want to include this information in your analysis.  You may not think that the income of purchasers does not directly influences the sale price of the home but instead influences the willingness of the purchaser to negotiate (the error is heteroskedastic in purchaser income).  Thus one may argue that the variability of home prices might be a function of purchaser income.

clear
set seed 1221
set obs 1000

* Let's generate a sample data set that has heteroskedasticity in z.
gen x = rnormal()
gen z = runiform()*5

* This means the standard deviation on u is 18 plus 16*z (which has a mean of 2.5)
gen u = ((18+16*z))*rnormal()

gen y = 4*x + u

* Let's imagine that we would like to estimate the effect of x on y.
reg y x

* This effect is small and not statistically significant.  We would like to control for some of heterogeneity in order to get a better estimate.

* We suspect the heterogeneity in the error might be related to z.
predict uhat, resid

scatter u z, sort title(Heteroskedastic Normally Distributed Error)
* This is a classic form of heteroskedasticity.

* There are some two step procedures that can be done to control for this form of heteroskedasticity, such as inverse variance weighting.

* However, these methods have been shown to produce bias (when the result of two step procedures  and are in general less efficient.

* The better method is to estimate the heteroskedasticity simultaneously with the coefficient on x using Maximum Likelihoods.

* I will write a small program to do this.

* Following William Gould, Jeffrey Pitblado, and Brian Poi's book on maximum likelihood in Stata an estimation that allows for explicit modelling of the variance can be defined as follows.

cap program drop myNormalReg
program define myNormalReg

* The first argument of any maximum likelihood program is the name of the temporary log likelihood variable created by the stata when the ml procedure is called.
* Each additional argument is an linear "equation" that Stata maximizes by choosing estimators for.
args lnlk xb sigma
* Thus we have two equations that we are asking Stata to solve.

* The following is the log likelihood value that we would like Stata to maximize.
* In this case, it is the log of the normal density asking Stata to choose the optimal mean and standard deviation.
qui replace lnlk' = ln(normalden($ML_y, xb', sigma')) end * This model is nearly identical to OLS except that is going to be slight variation due to the maximization program finding slightly different peaks. ml model lf myNormalReg (y = x) /sigma ml maximize * By specifying /sigma we are in effect telling stata that we know there is a second equation, that we want only a constant estimated for it and we would like to call it sigma. * An identical expression is as follows: ml model lf myNormalReg (y = x) (sigma:) ml maximize * What is more appealing to us and the reason we wrote the ML program is that now we can easily specify a structural form for sigma. ml model lf myNormalReg (y = x) (sigma: z) ml maximize * Our estimates on the coefficient on x have improved dramatically. This is somewhat of a fluke of the draw however. With a different seed, I found the estimates did not improve uniformly. On average though, we should expect the correctly specified functional form to outperform the OLS model (in terms of efficiency). * Why is that? Without structurally modelling the errors, MLE or OLS are in effect giving equal weight to the size of the error no matter where in the distribution of z the error is occurring. Thus a residual that is a 10 when z is equal to 0 is regarded as equally significant as an error which is 10 when z is 5. However, when looking at the distributions of residuals we can see that an error of 10 when z=0 is much larger proportional to the population of errors than that when z is large. The MLE form correctly (because we specified the data generating process) puts the correct weights on different errors with an error of 10 when z is equal to zero seen as much less likely than when z is equal to 5. * Now for a slight tangent. The following estimator does not work for me. cap program drop myNormalReg2 program define myNormalReg2 args lnlk xb sigma qui replace lnlk' = ln(normalden(($ML_y-xb')/(sigma')))
* The only difference is that I generate the residuals and put those directly into the normal density function.
* By the definition listed in the help file this form and the previous one are identical.
* So what is the problem?  I think it is a result of sigma being the denominator.
* Perhaps it is harder to maximize this problem than the previous, but if so I do not understand why.

* See comment below for the correction posted by Brian Poi (one of the coauthors of the aformentioned book).
end

ml model lf myNormalReg2 (y = x) /sigma
ml maximize

* A final note, simply because the Maximum Likelihood procedure assumes normally distributed errors does not mean that the above program will not provide good estimates when the error is not distributed normally.

clear
set seed 1221
set obs 1000

gen x = rnormal()
gen z = runiform()*5

* Let's instead say that the errors are distributed uniformly (-6,6).  This will generate a standard deviation of 1 since var(uniform) = (a-b)/12 = (6-(-6))/12 = 1
gen u = ((18+16*z))*(runiform()-.5)*(12^.5)

gen y = 4*x + u

reg y x
predict uhat, resid
scatter u z, sort title(Heteroskedastic Uniformly Distributed Error)
* The uniformaty of the error is obvious in the plot as a result of the clearly identifiable edges that characterize the uniform distribution.

ml model lf myNormalReg (y = x) (sigma:)
ml maximize

ml model lf myNormalReg (y = x) (sigma: z)
ml maximize
* We can see that though the error is not normally distributed and though myNormalReg assumes normality.  Overall it is working pretty well, though not as well as when the error is actually normally distributed.

1. Your myNormalReg2 program doesn't work, because your log-likelihood is off by a factor of ln(1/sigma). If you do

qui replace lnlk' = ln(normalden(($ML_y-xb')/sigma')) - ln(sigma') you'll get the right answer. You could also use the two-argument version of normalden() like this: qui replace lnlk' = ln(normalden(($ML_y-xb'), sigma'))

and get the same results.

Brian Poi

1. Thanks Brian, that is helpful.

2. There is one more way of programming the ML estimator that should not be neglected. This is probably the most demonstrative method:

cap program drop myNormalReg3
program define myNormalReg3
args lnlk xb sigma
* Rather than using the built in pdf function, we can specify it directly.
qui replace lnlk' = -ln(sigma' * sqrt(2*_pi)) - (\$ML_y-xb')^2/(2*`sigma'^2)
end

ml model lf myNormalReg3 (y = x) (sigma: )
ml maximize

ml model lf myNormalReg3 (y = x) (sigma: z)
ml maximize

3. In the calculation of the variance/standard deviation of the uniformly distributed variable, the variance is ((b-a)^2)/12 , and so the s.d. is (b-a)/(sqrt(12)), not (b-a)/12.

1. Please be specific about what you are attempting to correct. I believe the variance listed above is exactly what you say it is. For the sd 12^.5 is the same as sqrt(12).

"* Let's instead say that the errors are distributed uniformly (-6,6). This will generate a standard deviation of 1 since var(uniform) = (a-b)/12 = (6-(-6))/12 = 1
gen u = ((18+16*z))*(runiform()-.5)*(12^.5)"

Is there some other part of the code that you found an error in?