Thursday, March 21, 2013

MSimulate Command

CLT is a powerful property

* Stata has a wonderfully effective simulate function that allows users to easily simulate data and analysis in a very rapid fashion.

* The only drawback is that when you run it, it will replace the data in memory with the simulated data automatically.

* Which is not a big problem if you stick a preserve in front of your simulate command.

* However, you may want to run sequential simulates and keep the data form all of the simulations together rather than only temporarily accessed.

* Fortunately we can accomplished this task by writing a small program.

cap program drop msim
program define msim

  * Gettoken will split the arguments fed into msim into those before colon and those after.
  gettoken before after : 0 , parse(":")

  * I really like this feature of Stata!

  * First let's strip the colon.  The 1 is important since we want to make sure to only remove the first colon.
  local simulation = subinstr("`after'", ":", "", 1)

  * Now what I propose is that the argument in `before' is used as an extension for names of variables created by the simulate command.

  * First let's save the current data set.

  * Generate an id that we will later use to merge in more data
  cap gen id = _n

  * Save the current data to a temporary location
  tempfile tempsave
  save `tempsave'

  * Now we run the simulation which wipes out the current data.

  * First we will rename all of the variables to have an extension equal to the first argument
  foreach v of varlist * {
    cap rename `v' `v'`before'

  * Now we need to generate the ID to merge into

  cap gen id = _n

  merge 1:1 id using `tempsave'
  * Get rid of the _merge variable generated from the above command.
  drop _merge

* Let's write a nice little program that we would like to simulate.
cap program drop simCLT
program define simCLT

  set obs `1'
  * 1 is defined as the first argument of the program sim

  * Let's say we would like to see how many observations we need for the central limit theorem (CLT) to make the means of a bernoulli distribution look normal.  Remember, so long as the mean and variance is defined the generally central limit theorem will eventually force any random distribution of means to approximate a normal distribution as the number of observations gets large.
  gen x = rbinomial(1,.25)

  sum x

* So let's see first how the simulate command works initially

simulate, rep(200): simCLT 100

* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)

hist mean, kden
* The mean is looking good but not normal

* Now normally what we need to do would be to run simulate again with a different argument.

* But instead let's try our new command with 200!

* But instead let's try our new command!
* Clear out the old results
msim 100: simulate, rep(200): simCLT 100

msim 200: simulate, rep(200): simCLT 200

* Looks good!

msim 400: simulate, rep(200): simCLT 400

msim 1000: simulate, rep(200): simCLT 1000

msim 10000: simulate, rep(200): simCLT 10000

msim 100000: simulate, rep(200): simCLT 100000

msim 1000000: simulate, rep(200): simCLT 100000

* The next two commands can take a little while.
msim 10000000: simulate, rep(200): simCLT 1000000

msim 100000000: simulate, rep(200): simCLT 10000000

sum mean*
* We can see that the standard deviations are getting smaller with a larger sample size.

* How is the histograms looking?
foreach v in 100 200 400 100 1000 10000 100000 1000000 10000000 {
  hist mean`v', name(s`v', replace)  nodraw  title(`v') kden


graph combine s100 s200 s400 s1000 s1000 s10000 s100000 s1000000 s10000000 ///
  , title("CLT between 100 and 10,000,000 observations")

* We can see that the distribution of means approximates the normal distribution as the number of draws in each sample gets large.

* This is one of the fundamental findings of statistics and pretty awesome if you think about it.

No comments:

Post a Comment