## Monday, August 27, 2012

### Test Rejection Power for Failure of Normality Assumption

* In order for OLS to be the best unbiased estimator, the error must be distributed normally (along with other OLS assumptions).

* Failure or the normality assumption does not cause bias but may cause the estimator to be potentially less efficient than some non-linear estimators.  However, so long as the error is spherical (homoskedastic and uncorrelated) then the OLS estimator is still BLUE (best lineary unbiased estimator).

* Likewise, in order to calculate p-values and confidence intervals accurately the underlying distribution must be normal.

* However, by the central limit theorem so long as E(error^2)
* Of course, asympotics sometimes only appear in very large samples.

* In this post I will explore what happens to confidence intervals when the error is not distributed normally distributed and the sample size is "small".

cap program drop randols
program define randols, rclass

* This tells stata to call the first argument in the command randdraw N and the second distribution
args N distribution

* The first arugment of the
clear
set obs N'

* Generate the variable
gen u = distribution'

gen x = rnormal()

gen y = 1*x + u

reg y x

* The Degrees of freedom are N less the number of parameters estimates (2)
local DOF = =N'-2'

* The (10%) condifence interval is Bhat-t(.05)*SE, Bhat+t(.05)*SE
local lowerCI = _b[x]-invttail(DOF', .05)*_se[x]
local upperCI = _b[x]+invttail(DOF', .05)*_se[x]

* Now we check if the estimated coefficient is within the CI
* Since the true coefficient is 1, we are checking if the CI encloses 1.
if lowerCI'<1 amp="amp" upperci="upperci">1 return scalar within = 1
if !(lowerCI'<1 amp="amp" upperci="upperci">1) return scalar within = 0
end

randols 100 runiform()
return list

* Now we are set up to do some tests to see how well the 90% CI is working.

* If the CI is too wide (the standard error estimates are too large) then the CI will enclose the true value more than 90% of the time.

* If the CI is too tight then the CI will enclose the true vale too few times.

* As a means of comparison let's start with a normal distribution of the error.

* Set the random see so as to create replicable runs.
simulate r(within), reps(2000) seed(1011) : randols 100 rnormal()
sum

* We can see that the standard estimates are working well, capturing the CI at 90.55% of the time.

* We do not expect exactly 90% since our simulation is a series of random draws as well.
* Now let's try a uniform distribution.  Note because the uniform has a non-zero mean this will effect the constant estimate, but since we are not interested in the constant we don't mind.
simulate r(within), reps(2000) seed(1011) : randols 100 runiform()
sum
* Likewise the random uniform distribution does not present a serious problem in estimating CI.

* Perhaps the bernoulli(.5)
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
sum
* Likewise, the normality assumption is working very well even in binomial(.5) errors.

* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)
sum
* Nope

* It should not matter if we increase the size of the error either since the CI should increase in size proportionate the size of the error estimated in the underlying population.

* Perhaps the bernoulli(.15) will be a little more challenging
simulate r(within), reps(2000) seed(1011) : randols 100 rbinomial(1,.5)*100
sum
* Nope

* Perhaps if we reduce the sample size, this should make non-normal distribution more problematic.
simulate r(within), reps(2000) seed(1011) : randols 10 rbinomial(1,.5)*100
sum

* Even using a sample size of 10 the CI is working beautifully.

* Let's try using a distribution which is extremely problematic:

* CDF(standard Cauchy) = 1/pi * arctan(x) + 1/2
* runiform() = 1/pi * arctan(x) + 1/2
* (runifrom() - 1/2) = 1/pi * arctan(x)
* pi(runifrom() - 1/2) = arctan(x)
* tan(pi(runifrom() - 1/2)) = x

simulate r(within), reps(2000) seed(1011) : randols 10 tan(_pi*(runiform()-1/2))
sum

* For even the Cauchy distribution (a distribution for which the CLT does not apply) the CI calculations are close

* Let's try a larger sample size.

simulate r(within), reps(2000) seed(1011) : randols 1000 tan(_pi*(runiform()-1/2))
sum

* Alright, Cauchy is not enough to throw off our CI.

simulate r(within), reps(2000) seed(1011) : randols 1000 1/(rnormal()/10)

* There should be a lot of observations falling near the discontinuous region at 1/0.

sum
* Yet the CI approximation assuming normally distributed errors continues to work extremely well with a overrejection rate of .002% off.

* This is a pretty convincing series of simulations to me.

* If you can concieve of a distribution of errors/sample size that makes the CI fail badly please post them as a response to this post!