CLT is a powerful property |
* Stata has a wonderfully effective simulate function that allows users to easily simulate data and analysis in a very rapid fashion.
* The only drawback is that when you run it, it will replace the data in memory with the simulated data automatically.
* Which is not a big problem if you stick a preserve in front of your simulate command.
* However, you may want to run sequential simulates and keep the data form all of the simulations together rather than only temporarily accessed.
* Fortunately we can accomplished this task by writing a small program.
cap program drop msim
program define msim
* Gettoken will split the arguments fed into msim into those before colon and those after.
gettoken before after : 0 , parse(":")
* I really like this feature of Stata!
* First let's strip the colon. The 1 is important since we want to make sure to only remove the first colon.
local simulation = subinstr("`after'", ":", "", 1)
* Now what I propose is that the argument in `before' is used as an extension for names of variables created by the simulate command.
* First let's save the current data set.
* Generate an id that we will later use to merge in more data
cap gen id = _n
* Save the current data to a temporary location
tempfile tempsave
save `tempsave'
* Now we run the simulation which wipes out the current data.
`simulation'
* First we will rename all of the variables to have an extension equal to the first argument
foreach v of varlist * {
cap rename `v' `v'`before'
}
* Now we need to generate the ID to merge into
cap gen id = _n
merge 1:1 id using `tempsave'
* Get rid of the _merge variable generated from the above command.
drop _merge
end
* Let's write a nice little program that we would like to simulate.
cap program drop simCLT
program define simCLT
clear
set obs `1'
* 1 is defined as the first argument of the program sim
* Let's say we would like to see how many observations we need for the central limit theorem (CLT) to make the means of a bernoulli distribution look normal. Remember, so long as the mean and variance is defined the generally central limit theorem will eventually force any random distribution of means to approximate a normal distribution as the number of observations gets large.
gen x = rbinomial(1,.25)
sum x
end
* So let's see first how the simulate command works initially
simulate, rep(200): simCLT 100
* The simulate command will automatically save the returns from the sum command as variables (at least in version 12)
hist mean, kden
* The mean is looking good but not normal
* Now normally what we need to do would be to run simulate again with a different argument.
* But instead let's try our new command with 200!
* But instead let's try our new command!
clear
* Clear out the old results
msim 100: simulate, rep(200): simCLT 100
msim 200: simulate, rep(200): simCLT 200
* Looks good!
msim 400: simulate, rep(200): simCLT 400
msim 1000: simulate, rep(200): simCLT 1000
msim 10000: simulate, rep(200): simCLT 10000
msim 100000: simulate, rep(200): simCLT 100000
msim 1000000: simulate, rep(200): simCLT 100000
* The next two commands can take a little while.
msim 10000000: simulate, rep(200): simCLT 1000000
msim 100000000: simulate, rep(200): simCLT 10000000
sum mean*
* We can see that the standard deviations are getting smaller with a larger sample size.
* How is the histograms looking?
foreach v in 100 200 400 100 1000 10000 100000 1000000 10000000 {
hist mean`v', name(s`v', replace) nodraw title(`v') kden
}
graph combine s100 s200 s400 s1000 s1000 s10000 s100000 s1000000 s10000000 ///
, title("CLT between 100 and 10,000,000 observations")
* We can see that the distribution of means approximates the normal distribution as the number of draws in each sample gets large.
* This is one of the fundamental findings of statistics and pretty awesome if you think about it.
No comments:
Post a Comment