Friday, August 24, 2012

Central Limit Theorem

* I find the central limit theorem pretty darn amazing. 

* For a series of random draws from any well behaved distribution, the mean of that series of samples from that distribution is normally distributed as the size of those samples gets large.

* This might seem at first like a lot of mumbo jumbo but this result is easily demonstrable and extremely useful.

* To show the central limit theorem (CLT) in action we will use Stata's simulate command.

* First we need to program a "program" to simulate

cap program drop randdraw
program define randdraw


  * This tells stata to call the first argument in the command randdraw N and the second distribution
  args N distribution

  * The first arugment of the
  set obs `N'

  * Generate the variable
  gen x = `distribution'

  sum x


* It is not neccessary that the mean be equal to zero but it might be useful for comparison between different distributions
randdraw 1000 "rpoisson(10)-10"
* This command drew 1000 draws from the poisson distribution less 10.  Which has a E(poisson(10)-10)=E(poisson(10))-10=10-10=0

simulate mean_x =r(mean), reps(1000): randdraw 1000 "rpoisson(10)-10"

hist mean_x, kden
* Looks pretty normal, but that is not so surprising because the poisson distribution begins to look pretty normal once k gets as large as 10.

* So how exactly does the central limit relate to all of this and why is it so cool?

* The argument is that as N the number of draws going into each mean calculation gets large, the distribution of the means begins to look normal.

* Lets first try something as far from normal as possible (not a technical definition)
simulate mean_x =r(mean), reps(1000): randdraw 10 "rbinomial(1,.05)-.05"

* This is the mean of a bernoulli distribution (weighted to .05) either 0 or 1 with each mean representing 10 draws.
hist mean_x

* We can see that this is clearly not normally distributed.  However what happens when we increase the draws to 100 rather than 10?
simulate mean_x =r(mean), reps(1000): randdraw 100 "rbinomial(1,.05)-.05"
hist mean_x

* Already things are beginning to look a lot more like a normal distribution.  Obviously it is not normal because the potential outcomes are discrete.

* Perhaps it will look better at 1000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rbinomial(1,.05)-.05"
hist mean_x

* This is what is meant by asymptotically converging on a normal distribution.  As N gets large the distribution of the means begins to look more like a normal distribution (so long as the variance of x is finite).
* Crazy, huh?

* Perhaps it will look better at 10000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 10000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* Still not looking perfect.
simulate mean_x =r(mean), reps(1000): randdraw 100000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* Now things are looking very close to normal.  Of course there will be some randomness in the 1000 draws which have enough inherent variation to make even a normal variable look non-normal.

* This might take a little while but let's increase the draws one more time.
simulate mean_x =r(mean), reps(1000): randdraw 1000000 "rbinomial(1,.05)-.05"
hist mean_x, kden

* So using a x variable with are bernoulli(.05) leads to a convergence in means that looks normal not until around 100,000 draws.  This might sound bad but it really is not.

* Because, usually we don't need the individually drawn means TO BE normal just to be approximated by a normal distribution as well as their standardized values as being approximated by a chi-squared distribution for null tests.

* In addition, a bernoulli(.05) is an extreme distribution with a large skew and discrete outcomes.  Most distributions, even binary and count distributions are going to have properties which are more similar to a normal distribution causing more rapid convergence on normality.

* Even a .5 bernoulli begins to look pretty normal at 1000 draws.
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rbinomial(1,.5)-.5"
hist mean_x, kden

* Let alone distributions such as uniform
simulate mean_x =r(mean), reps(1000): randdraw 1000 "runiform()-.5"
hist mean_x, kden

* Compare this to a normal distribution for which the mean by definition normally distributed (because Normal + Normal ~ Normal)
simulate mean_x =r(mean), reps(1000): randdraw 1000 "rnormal()"
hist mean_x, kden

1 comment: