Monday, August 27, 2012

Test Rejection Power for Failure of Normality Assumption


* In order for OLS to be the best unbiased estimator, the error must be distributed normally (along with other OLS assumptions).

* Failure or the normality assumption does not cause bias but may cause the estimator to be potentially less efficient than some non-linear estimators.  However, so long as the error is spherical (homoskedastic and uncorrelated) then the OLS estimator is still BLUE (best lineary unbiased estimator).

* Likewise, in order to calculate p-values and confidence intervals accurately the underlying distribution must be normal.

* However, by the central limit theorem so long as E(error^2)
* Of course, asympotics sometimes only appear in very large samples.

* In this post I will explore what happens to confidence intervals when the error is not distributed normally distributed and the sample size is "small".

cap program drop randols
program define randols, rclass

  * This tells stata to call the first argument in the command randdraw N and the second distribution
  args N distribution

  * The first arugment of the
  clear
  set obs `N'

  * Generate the variable
  gen u = `distribution'

  gen x = rnormal()

  gen y = 1*x + u

  reg y x

  * The Degrees of freedom are N less the number of parameters estimates (2)
  local DOF = `=`N'-2'

  * The (10%) condifence interval is Bhat-t(.05)*SE, Bhat+t(.05)*SE
  local lowerCI = _b[x]-invttail(`DOF', .05)*_se[x]
  local upperCI = _b[x]+invttail(`DOF', .05)*_se[x]

  * Now we check if the estimated coefficient is within the CI
  * Since the true coefficient is 1, we are checking if the CI encloses 1.
  if `lowerCI'<1 amp="amp" upperci="upperci">1 return scalar within = 1
  if !(`lowerCI'<1 amp="amp" upperci="upperci">1) return scalar within = 0
end

randols 100 runiform()
  return list

* Now we are set up to do some tests to see how well the 90% CI is working.

* If the CI is too wide (the standard error estimates are too large) then the CI will enclose the true value more than 90% of the time.

* If the CI is too tight then the CI will enclose the true vale too few times.

* As a means of comparison let's start with a normal distribution of the error.

* Set the random see so as to create replicable runs.
simulate r(within), reps(2000) seed(1011) : randols 100 rnormal()
sum

* We can see that the standard estimates are working well, capturing the CI at 90.55% of the time.

* We do not expect exactly 90% since our simulation is a series of random draws as well.
* Now let's try a uniform distribution.  Note because the uniform has a non-zero mean this will effect the constant estimate, but since we are not interested in the constant we don't mind.
simulate r(within), reps(2000) seed(1011) : randols 100 runiform()
sum
* Likewise the random uniform distribution does not present a serious problem in estimating CI.

* Perhaps the bernoulli(.5)
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
sum
* Likewise, the normality assumption is working very well even in binomial(.5) errors.


* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
sum
* Nope

* It should not matter if we increase the size of the error either since the CI should increase in size proportionate the size of the error estimated in the underlying population.

* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)*100
sum
* Nope

* Perhaps if we reduce the sample size, this should make non-normal distribution more problematic.
simulate r(within), reps(2000) seed(1011) : randols 10 rbinomial(1,.5)*100
sum

* Even using a sample size of 10 the CI is working beautifully.

* Let's try using a distribution which is extremely problematic:

* CDF(standard Cauchy) = 1/pi * arctan(x) + 1/2
* runiform() = 1/pi * arctan(x) + 1/2
* (runifrom() - 1/2) = 1/pi * arctan(x)
* pi(runifrom() - 1/2) = arctan(x)
* tan(pi(runifrom() - 1/2)) = x

simulate r(within), reps(2000) seed(1011) : randols 10 tan(_pi*(runiform()-1/2))
sum

* For even the Cauchy distribution (a distribution for which the CLT does not apply) the CI calculations are close

* Let's try a larger sample size.

simulate r(within), reps(2000) seed(1011) : randols 1000 tan(_pi*(runiform()-1/2))
sum

* Alright, Cauchy is not enough to throw off our CI.

* How about:

simulate r(within), reps(2000) seed(1011) : randols 1000 1/(rnormal()/10)

* There should be a lot of observations falling near the discontinuous region at 1/0.

sum
* Yet the CI approximation assuming normally distributed errors continues to work extremely well with a overrejection rate of .002% off.

* This is a pretty convincing series of simulations to me.

* If you can concieve of a distribution of errors/sample size that makes the CI fail badly please post them as a response to this post!

2 comments:

  1. Nope - normality isn't needed for BLUE; i.e. for the Gauss-Markov Theorem. No specific distribution is used in the proof. If you have normality then the OLS estimator is Best Unbiased (not restricted to best in the class of linear estimators).

    ReplyDelete