## Thursday, March 21, 2013

### MSimulate Command

 CLT is a powerful property

* Stata has a wonderfully effective simulate function that allows users to easily simulate data and analysis in a very rapid fashion.

* The only drawback is that when you run it, it will replace the data in memory with the simulated data automatically.

* Which is not a big problem if you stick a preserve in front of your simulate command.

* However, you may want to run sequential simulates and keep the data form all of the simulations together rather than only temporarily accessed.

* Fortunately we can accomplished this task by writing a small program.

cap program drop msim
program define msim

* Gettoken will split the arguments fed into msim into those before colon and those after.
gettoken before after : 0 , parse(":")

* I really like this feature of Stata!

* First let's strip the colon.  The 1 is important since we want to make sure to only remove the first colon.
local simulation = subinstr("after'", ":", "", 1)

* Now what I propose is that the argument in before' is used as an extension for names of variables created by the simulate command.

* First let's save the current data set.

* Generate an id that we will later use to merge in more data
cap gen id = _n

* Save the current data to a temporary location
tempfile tempsave
save tempsave'

* Now we run the simulation which wipes out the current data.
simulation'

* First we will rename all of the variables to have an extension equal to the first argument
foreach v of varlist * {
cap rename v' v'before'
}

* Now we need to generate the ID to merge into

cap gen id = _n

merge 1:1 id using tempsave'
* Get rid of the _merge variable generated from the above command.
drop _merge
end

* Let's write a nice little program that we would like to simulate.
cap program drop simCLT
program define simCLT

clear
set obs 1'
* 1 is defined as the first argument of the program sim

* Let's say we would like to see how many observations we need for the central limit theorem (CLT) to make the means of a bernoulli distribution look normal.  Remember, so long as the mean and variance is defined the generally central limit theorem will eventually force any random distribution of means to approximate a normal distribution as the number of observations gets large.
gen x = rbinomial(1,.25)

sum x
end

* So let's see first how the simulate command works initially

simulate, rep(200): simCLT 100

* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)

hist mean, kden
* The mean is looking good but not normal

* Now normally what we need to do would be to run simulate again with a different argument.

* But instead let's try our new command with 200!

* But instead let's try our new command!
clear
* Clear out the old results
msim 100: simulate, rep(200): simCLT 100

msim 200: simulate, rep(200): simCLT 200

* Looks good!

msim 400: simulate, rep(200): simCLT 400

msim 1000: simulate, rep(200): simCLT 1000

msim 10000: simulate, rep(200): simCLT 10000

msim 100000: simulate, rep(200): simCLT 100000

msim 1000000: simulate, rep(200): simCLT 100000

* The next two commands can take a little while.
msim 10000000: simulate, rep(200): simCLT 1000000

msim 100000000: simulate, rep(200): simCLT 10000000

sum mean*
* We can see that the standard deviations are getting smaller with a larger sample size.

* How is the histograms looking?
foreach v in 100 200 400 100 1000 10000 100000 1000000 10000000 {
hist meanv', name(sv', replace)  nodraw  title(v') kden

}

graph combine s100 s200 s400 s1000 s1000 s10000 s100000 s1000000 s10000000 ///
, title("CLT between 100 and 10,000,000 observations")

* We can see that the distribution of means approximates the normal distribution as the number of draws in each sample gets large.

* This is one of the fundamental findings of statistics and pretty awesome if you think about it.