CLT is a powerful property |

* Stata has a wonderfully effective simulate function that allows users to easily simulate data and analysis in a very rapid fashion.

* The only drawback is that when you run it, it will replace the data in memory with the simulated data automatically.

* Which is not a big problem if you stick a preserve in front of your simulate command.

* However, you may want to run sequential simulates and keep the data form all of the simulations together rather than only temporarily accessed.

* Fortunately we can accomplished this task by writing a small program.

cap program drop msim

program define msim

* Gettoken will split the arguments fed into msim into those before colon and those after.

gettoken before after : 0 , parse(":")

* I really like this feature of Stata!

* First let's strip the colon. The 1 is important since we want to make sure to only remove the first colon.

local simulation = subinstr("`after'", ":", "", 1)

* Now what I propose is that the argument in `before' is used as an extension for names of variables created by the simulate command.

* First let's save the current data set.

* Generate an id that we will later use to merge in more data

cap gen id = _n

* Save the current data to a temporary location

tempfile tempsave

save `tempsave'

* Now we run the simulation which wipes out the current data.

`simulation'

* First we will rename all of the variables to have an extension equal to the first argument

foreach v of varlist * {

cap rename `v' `v'`before'

}

* Now we need to generate the ID to merge into

cap gen id = _n

merge 1:1 id using `tempsave'

* Get rid of the _merge variable generated from the above command.

drop _merge

end

* Let's write a nice little program that we would like to simulate.

cap program drop simCLT

program define simCLT

clear

set obs `1'

* 1 is defined as the first argument of the program sim

* Let's say we would like to see how many observations we need for the central limit theorem (CLT) to make the means of a bernoulli distribution look normal. Remember, so long as the mean and variance is defined the generally central limit theorem will eventually force any random distribution of means to approximate a normal distribution as the number of observations gets large.

gen x = rbinomial(1,.25)

sum x

end

* So let's see first how the simulate command works initially

simulate, rep(200): simCLT 100

* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)

hist mean, kden

* The mean is looking good but not normal

* Now normally what we need to do would be to run simulate again with a different argument.

* But instead let's try our new command with 200!

* But instead let's try our new command!

clear

* Clear out the old results

msim 100: simulate, rep(200): simCLT 100

msim 200: simulate, rep(200): simCLT 200

* Looks good!

msim 400: simulate, rep(200): simCLT 400

msim 1000: simulate, rep(200): simCLT 1000

msim 10000: simulate, rep(200): simCLT 10000

msim 100000: simulate, rep(200): simCLT 100000

msim 1000000: simulate, rep(200): simCLT 100000

* The next two commands can take a little while.

msim 10000000: simulate, rep(200): simCLT 1000000

msim 100000000: simulate, rep(200): simCLT 10000000

sum mean*

* We can see that the standard deviations are getting smaller with a larger sample size.

* How is the histograms looking?

foreach v in 100 200 400 100 1000 10000 100000 1000000 10000000 {

hist mean`v', name(s`v', replace) nodraw title(`v') kden

}

graph combine s100 s200 s400 s1000 s1000 s10000 s100000 s1000000 s10000000 ///

, title("CLT between 100 and 10,000,000 observations")

* We can see that the distribution of means approximates the normal distribution as the number of draws in each sample gets large.

* This is one of the fundamental findings of statistics and pretty awesome if you think about it.

## No comments :

## Post a Comment